Skip to content

Conversation

@sdh1014
Copy link
Contributor

@sdh1014 sdh1014 commented Oct 29, 2025

Changes

  • Utilize proc_bind and num_threads within the OMP dialect attribute in MLIR to control thread binding and NUMA affinity
  • Implement warm-up passes and average execution times across multiple computations
  • Introduce fast sampling for large datasets: extract one element from each corner of the output matrix, compare against expected values, and print results to validate data accuracy.
  • Replaced scf.parallel with omp.wsloop schedule(static)

Experimental Overview

In the matmul vectorization example, add the proc_bind and num_threads attributes from the MLIR OMP dialect, and conduct performance testing using different configurations of num_threads and proc_bind.

Hardware Configuration

  • CPU: Intel Xeon Platinum 8575C
  • Total Logical CPUs:192(2 sockets × 48 physical cores × SMT2)
  • NUMA Topology: 4 NUMA nodes
  • Frequency: max 4.0 GHz

Software Environment

  • Project: examples/BuddyNext/next-sgemm-unroll-vec-fixed-aot
  • Compilation Command: make next-sgemm-unroll-vec-fixed-ao
  • Execution Command: ./next-sgemm-unroll-vec-fixed-ao

Experiment Process

Step 1

To reduce the randomness of single calculation results, a warm-up loop was added in func.func @main(), repeating 10 times. After that, it iterates 50 times and calculates the average time per iteration as the statistic.

Step 2

Keep the original implementation of func.func @sgemm_v1_32, and comment out the print operation in the original op. The average time per iteration is around 0.016s, and the CPU utilization fluctuates significantly, peaking at 19200%.
image

Step 3

Wrap the original scf.parallel in an omp.parallel to create a fixed-size OpenMP thread team for explicit management. The code modification is as follows:

    omp.parallel num_threads(%numThreads : i32) proc_bind(close) {
      scf.parallel (%m_idx) = (%c0) to (%m) step (%unroll) {
		……

With proc_bind=close fixed, compare the situations when num_thread=1/2/8/40/96/128/192. The result shows that the CPU utilization is still very high when num_thread=1, which obviously does not meet the expected CPU utilization for num_thread=1.
image

But when num_thread=2/8/48, the CPU utilization meets expectations, yet the average time per iteration is around 0.67s, significantly slower than the num_thread=1 case. When num_thread=96, the average time per iteration is 0.78s, and when thread is increased to 128, the time is even longer at 1.19337s. The screenshot below shows the execution with thread=2:
image

Afterward, fix num_thread=1/48 and compare the differences between various proc_bind settings.

From the results, simply adding omp.parallel might be because scf.parallel itself is parallelized during subsequent lowering, and the introduction of omp.parallel causes nested parallelism and more scheduling overhead.

Step 4

Use omp.wsloop schedule(static) to replace scf.parallel. The code modification is the final version submitted in the PR commit.

Fix proc_bind=close and compare the situations when num_threads=1/2/8/24/48/96/144/192. It was found that:
It can be observed that the time cost is lowest when num_threads=144.
image

Afterward, fix num_threads=48/96/144 and compare the differences between various proc_bind settings.

Experiment Result

Only adding omp.parallel

Group numThreads proc_bind OMP_PLACES Average time (s) Peak CPU % Memory (KiB) Result Verification
1 1 close Default 0.0175304 17683 76708
2 1 spread Default 0.0156646 16689 75768
3 1 primary Default 0.0167163 19200 77580
4 2 close Default 0.572729 212 75628
5 8 close Default 0.688629 818 75840
6 48 close Default 0.680136 4761 76476
7 48 spread Default 0.678398 4898 76380
8 48 primary Default 0.668907 4674 76616
9 96 close Default 0.843882 9812 84504
10 128 close Default 1.19893 12811 92364
11 196 close Default 1.59193 92000 105000

Replace omp.wsloop schedule(static) with scf.parallel

Group numThreads proc_bind OMP_PLACES Average time (s) Peak CPU % Memory (KiB) Result Verification
1 1 close Default 0.564425 105 75588
2 2 close Default 0.328066 204 75432
3 4 close Default 0.172736 403 75404
4 24 close Default 0.039234 2421 75620
5 48 close Default 0.0204416 4917 75660
6 48 spread Default 0.0195666 4850 75788
7 48 primary Default 0.0201573 4721 75596
8 96 close Default 0.01596 11129 75740
9 96 spread Default 0.0145797 11465 75672
10 96 primary Default 0.0161171 9474 75716
11 144 close Default 0.0101853 14168 75584
12 144 spread Default 0.00982 14400 75672
13 144 primary Default 0.00981438 14073 75616
14 196 close Default 0.0164667 18149 75792

Conclusion

Based on the analysis of the experimental results, it should be viewed in two parts:

  • The original scf.parallel is parallelized during lowering. Therefore, simply adding num_threads and proc_bind from the omp dialect, despite explicitly declaring num_threads, fails to utilize the parallelism of scf.parallel and instead becomes a computational burden.

  • To explicitly control the number of parallel threads, replacing scf.parallel with omp.wsloop schedule(static) can achieve the expected experimental results. And regarding the results:

    • For num_threads, a value of 144 yields the lowest average time per iteration, outperforming the parallelized scf.parallel implementation.
    • For proc_bind, the differences between various proc_bind settings are not significant for the same thread count and are within the margin of error. If a choice must be made, the intuitive feeling is that spread ≈ primary > close. Using spread and primary appears to be more stable.

Note

  • Although the statistical computation time for the experiment uses warm-ups and averaging over multiple iterations, the numerical values still exhibit certain fluctuations.
  • The data in the results are instantaneous observations; for example, the peak CPU values have a certain margin of error.

Related to #600

@sdh1014 sdh1014 changed the title [examples] [examples] Add MLIR OMP attributes to control thread count and NUMA affinit y Oct 29, 2025
@sdh1014 sdh1014 changed the title [examples] Add MLIR OMP attributes to control thread count and NUMA affinit y [examples] Add MLIR OMP attributes to control thread count and NUMA affinity Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant