[examples] Add MLIR OMP attributes to control thread count and NUMA affinity #602

sdh1014 · 2025-10-29T06:26:11Z

Changes

Utilize proc_bind and num_threads within the OMP dialect attribute in MLIR to control thread binding and NUMA affinity
Implement warm-up passes and average execution times across multiple computations
Introduce fast sampling for large datasets: extract one element from each corner of the output matrix, compare against expected values, and print results to validate data accuracy.
Replaced scf.parallel with omp.wsloop schedule(static)

Experimental Overview

In the matmul vectorization example, add the proc_bind and num_threads attributes from the MLIR OMP dialect, and conduct performance testing using different configurations of num_threads and proc_bind.

Hardware Configuration

CPU: Intel Xeon Platinum 8575C
Total Logical CPUs:192(2 sockets × 48 physical cores × SMT2)
NUMA Topology: 4 NUMA nodes
Frequency: max 4.0 GHz

Software Environment

Project: examples/BuddyNext/next-sgemm-unroll-vec-fixed-aot
Compilation Command: make next-sgemm-unroll-vec-fixed-ao
Execution Command: ./next-sgemm-unroll-vec-fixed-ao

Experiment Process

Step 1

To reduce the randomness of single calculation results, a warm-up loop was added in func.func @main(), repeating 10 times. After that, it iterates 50 times and calculates the average time per iteration as the statistic.

Step 2

Keep the original implementation of func.func @sgemm_v1_32, and comment out the print operation in the original op. The average time per iteration is around 0.016s, and the CPU utilization fluctuates significantly, peaking at 19200%.

Step 3

Wrap the original scf.parallel in an omp.parallel to create a fixed-size OpenMP thread team for explicit management. The code modification is as follows:

    omp.parallel num_threads(%numThreads : i32) proc_bind(close) {
      scf.parallel (%m_idx) = (%c0) to (%m) step (%unroll) {
		……

With proc_bind=close fixed, compare the situations when num_thread=1/2/8/40/96/128/192. The result shows that the CPU utilization is still very high when num_thread=1, which obviously does not meet the expected CPU utilization for num_thread=1.

But when num_thread=2/8/48, the CPU utilization meets expectations, yet the average time per iteration is around 0.67s, significantly slower than the num_thread=1 case. When num_thread=96, the average time per iteration is 0.78s, and when thread is increased to 128, the time is even longer at 1.19337s. The screenshot below shows the execution with thread=2:

Afterward, fix num_thread=1/48 and compare the differences between various proc_bind settings.

From the results, simply adding omp.parallel might be because scf.parallel itself is parallelized during subsequent lowering, and the introduction of omp.parallel causes nested parallelism and more scheduling overhead.

Step 4

Use omp.wsloop schedule(static) to replace scf.parallel. The code modification is the final version submitted in the PR commit.

Fix proc_bind=close and compare the situations when num_threads=1/2/8/24/48/96/144/192. It was found that:
It can be observed that the time cost is lowest when num_threads=144.

Afterward, fix num_threads=48/96/144 and compare the differences between various proc_bind settings.

Experiment Result

Only adding `omp.parallel`

Group	numThreads	proc_bind	OMP_PLACES	Average time (s)	Peak CPU %	Memory (KiB)	Result Verification
1	1	close	Default	0.0175304	17683	76708	✅
2	1	spread	Default	0.0156646	16689	75768	✅
3	1	primary	Default	0.0167163	19200	77580	✅
4	2	close	Default	0.572729	212	75628	✅
5	8	close	Default	0.688629	818	75840	✅
6	48	close	Default	0.680136	4761	76476	✅
7	48	spread	Default	0.678398	4898	76380	✅
8	48	primary	Default	0.668907	4674	76616	✅
9	96	close	Default	0.843882	9812	84504	✅
10	128	close	Default	1.19893	12811	92364	✅
11	196	close	Default	1.59193	92000	105000	✅

Replace `omp.wsloop schedule(static)` with `scf.parallel`

Group	numThreads	proc_bind	OMP_PLACES	Average time (s)	Peak CPU %	Memory (KiB)	Result Verification
1	1	close	Default	0.564425	105	75588	✅
2	2	close	Default	0.328066	204	75432	✅
3	4	close	Default	0.172736	403	75404	✅
4	24	close	Default	0.039234	2421	75620	✅
5	48	close	Default	0.0204416	4917	75660	✅
6	48	spread	Default	0.0195666	4850	75788	✅
7	48	primary	Default	0.0201573	4721	75596	✅
8	96	close	Default	0.01596	11129	75740	✅
9	96	spread	Default	0.0145797	11465	75672	✅
10	96	primary	Default	0.0161171	9474	75716	✅
11	144	close	Default	0.0101853	14168	75584	✅
12	144	spread	Default	0.00982	14400	75672	✅
13	144	primary	Default	0.00981438	14073	75616	✅
14	196	close	Default	0.0164667	18149	75792	✅

Conclusion

Based on the analysis of the experimental results, it should be viewed in two parts:

The original scf.parallel is parallelized during lowering. Therefore, simply adding num_threads and proc_bind from the omp dialect, despite explicitly declaring num_threads, fails to utilize the parallelism of scf.parallel and instead becomes a computational burden.
To explicitly control the number of parallel threads, replacing scf.parallel with omp.wsloop schedule(static) can achieve the expected experimental results. And regarding the results:
- For num_threads, a value of 144 yields the lowest average time per iteration, outperforming the parallelized scf.parallel implementation.
- For proc_bind, the differences between various proc_bind settings are not significant for the same thread count and are within the margin of error. If a choice must be made, the intuitive feeling is that spread ≈ primary > close. Using spread and primary appears to be more stable.

Note

Although the statistical computation time for the experiment uses warm-ups and averaging over multiple iterations, the numerical values still exhibit certain fluctuations.
The data in the results are instantaneous observations; for example, the peak CPU values have a certain margin of error.

Related to #600

…ffinity

sdh1014 added 4 commits October 29, 2025 05:49

[examples] Add MLIR OMP attributes to control thread count and NUMA a…

65ecbd9

…ffinity

[examples] Add a warmup loop and report the average runtime

0cf22ab

[examples] Add a quick sampling to verify correctness on large sizes

76b9faa

[examples] use OpenMP worksharing in the loop

906725c

sdh1014 changed the title ~~[examples]~~ [examples] Add MLIR OMP attributes to control thread count and NUMA affinit y Oct 29, 2025

sdh1014 changed the title ~~[examples] Add MLIR OMP attributes to control thread count and NUMA affinit y~~ [examples] Add MLIR OMP attributes to control thread count and NUMA affinity Oct 29, 2025

[examples] Add OpenMP lowering to next-sgemm RUN pipeline

b940df0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[examples] Add MLIR OMP attributes to control thread count and NUMA affinity #602

[examples] Add MLIR OMP attributes to control thread count and NUMA affinity #602

Uh oh!

sdh1014 commented Oct 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[examples] Add MLIR OMP attributes to control thread count and NUMA affinity #602

Are you sure you want to change the base?

[examples] Add MLIR OMP attributes to control thread count and NUMA affinity #602

Uh oh!

Conversation

sdh1014 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Experimental Overview

Hardware Configuration

Software Environment

Experiment Process

Step 1

Step 2

Step 3

Step 4

Experiment Result

Only adding omp.parallel

Replace omp.wsloop schedule(static) with scf.parallel

Conclusion

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sdh1014 commented Oct 29, 2025 •

edited

Loading

Only adding `omp.parallel`

Replace `omp.wsloop schedule(static)` with `scf.parallel`