[examples] Add MLIR OMP attributes to control thread count and NUMA affinity #602
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
proc_bindandnum_threadswithin the OMP dialect attribute in MLIR to control thread binding and NUMA affinityscf.parallelwithomp.wsloop schedule(static)Experimental Overview
In the matmul vectorization example, add the proc_bind and num_threads attributes from the MLIR OMP dialect, and conduct performance testing using different configurations of num_threads and proc_bind.
Hardware Configuration
Software Environment
examples/BuddyNext/next-sgemm-unroll-vec-fixed-aotmake next-sgemm-unroll-vec-fixed-ao./next-sgemm-unroll-vec-fixed-aoExperiment Process
Step 1
To reduce the randomness of single calculation results, a warm-up loop was added in
func.func @main(), repeating 10 times. After that, it iterates 50 times and calculates the average time per iteration as the statistic.Step 2
Keep the original implementation of

func.func @sgemm_v1_32, and comment out the print operation in the original op. The average time per iteration is around 0.016s, and the CPU utilization fluctuates significantly, peaking at 19200%.Step 3
Wrap the original
scf.parallelin anomp.parallelto create a fixed-size OpenMP thread team for explicit management. The code modification is as follows:With proc_bind=close fixed, compare the situations when

num_thread=1/2/8/40/96/128/192. The result shows that the CPU utilization is still very high whennum_thread=1, which obviously does not meet the expected CPU utilization fornum_thread=1.But when

num_thread=2/8/48, the CPU utilization meets expectations, yet the average time per iteration is around 0.67s, significantly slower than thenum_thread=1case. Whennum_thread=96, the average time per iteration is 0.78s, and whenthreadis increased to 128, the time is even longer at 1.19337s. The screenshot below shows the execution with thread=2:Afterward, fix
num_thread=1/48and compare the differences between variousproc_bindsettings.From the results, simply adding
omp.parallelmight be because scf.parallel itself is parallelized during subsequent lowering, and the introduction of omp.parallel causes nested parallelism and more scheduling overhead.Step 4
Use
omp.wsloop schedule(static)to replacescf.parallel. The code modification is the final version submitted in the PR commit.Fix

proc_bind=closeand compare the situations whennum_threads=1/2/8/24/48/96/144/192. It was found that:It can be observed that the time cost is lowest when
num_threads=144.Afterward, fix
num_threads=48/96/144and compare the differences between various proc_bind settings.Experiment Result
Only adding
omp.parallelReplace
omp.wsloop schedule(static)withscf.parallelConclusion
Based on the analysis of the experimental results, it should be viewed in two parts:
The original
scf.parallelis parallelized during lowering. Therefore, simply addingnum_threadsandproc_bindfrom theomp dialect, despite explicitly declaringnum_threads, fails to utilize the parallelism ofscf.paralleland instead becomes a computational burden.To explicitly control the number of parallel threads, replacing
scf.parallelwithomp.wsloop schedule(static)can achieve the expected experimental results. And regarding the results:num_threads, a value of 144 yields the lowest average time per iteration, outperforming the parallelizedscf.parallelimplementation.proc_bind, the differences between variousproc_bindsettings are not significant for the same thread count and are within the margin of error. If a choice must be made, the intuitive feeling is that spread ≈ primary > close. Using spread and primary appears to be more stable.Note
Related to #600