Skip to content

Conversation

christiangnrd
Copy link
Member

Opened to run benchmarks.

Todo:

  • Add compat bound when GPUArrays version released

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Benchmark suite Current: b296d15 Previous: 1942968 Ratio
latency/precompile 9804167000 ns 9844653958 ns 1.00
latency/ttfp 3988041166.5 ns 3972040229 ns 1.00
latency/import 1280372083 ns 1275530958.5 ns 1.00
integration/metaldevrt 822834 ns 828500 ns 0.99
integration/byval/slices=1 1544479.5 ns 1536750 ns 1.01
integration/byval/slices=3 10917542 ns 9632625 ns 1.13
integration/byval/reference 1542000 ns 1543583 ns 1.00
integration/byval/slices=2 2646833.5 ns 2621958.5 ns 1.01
kernel/indexing 544708 ns 567792 ns 0.96
kernel/indexing_checked 552500 ns 569292 ns 0.97
kernel/launch 8875 ns 9208 ns 0.96
array/construct 6000 ns 6625 ns 0.91
array/broadcast 567417 ns 583375 ns 0.97
array/random/randn/Float32 788125 ns 784333 ns 1.00
array/random/randn!/Float32 634250 ns 623250 ns 1.02
array/random/rand!/Int64 566645.5 ns 547458 ns 1.04
array/random/rand!/Float32 593833 ns 585291 ns 1.01
array/random/rand/Int64 750291.5 ns 771250 ns 0.97
array/random/rand/Float32 611167 ns 622687 ns 0.98
array/accumulate/Int64/1d 2082292 ns 1277104.5 ns 1.63
array/accumulate/Int64/dims=1 2177000 ns 1868333 ns 1.17
array/accumulate/Int64/dims=2 2116083 ns 2183625 ns 0.97
array/accumulate/Int64/dims=1L 6449187.5 ns 11737104 ns 0.55
array/accumulate/Int64/dims=2L 17938375 ns 9771416.5 ns 1.84
array/accumulate/Float32/1d 1693042 ns 1142833 ns 1.48
array/accumulate/Float32/dims=1 1982625 ns 1570458 ns 1.26
array/accumulate/Float32/dims=2 2028583 ns 1931625 ns 1.05
array/accumulate/Float32/dims=1L 4866500 ns 9864375 ns 0.49
array/accumulate/Float32/dims=2L 14886187.5 ns 7308021 ns 2.04
array/reductions/reduce/Int64/1d 1432291.5 ns 1373353.5 ns 1.04
array/reductions/reduce/Int64/dims=1 1059375 ns 1069291.5 ns 0.99
array/reductions/reduce/Int64/dims=2 1181312.5 ns 1193292 ns 0.99
array/reductions/reduce/Int64/dims=1L 2088917 ns 2113062.5 ns 0.99
array/reductions/reduce/Int64/dims=2L 3433958 ns 3456458 ns 0.99
array/reductions/reduce/Float32/1d 946750 ns 971625 ns 0.97
array/reductions/reduce/Float32/dims=1 783562.5 ns 808458 ns 0.97
array/reductions/reduce/Float32/dims=2 771312.5 ns 768979 ns 1.00
array/reductions/reduce/Float32/dims=1L 1720625 ns 1739041 ns 0.99
array/reductions/reduce/Float32/dims=2L 1760062.5 ns 1772125 ns 0.99
array/reductions/mapreduce/Int64/1d 1489271 ns 1456146 ns 1.02
array/reductions/mapreduce/Int64/dims=1 1097750 ns 1074875 ns 1.02
array/reductions/mapreduce/Int64/dims=2 1118541.5 ns 1206417 ns 0.93
array/reductions/mapreduce/Int64/dims=1L 2112562.5 ns 2119292 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 3424187.5 ns 3444375 ns 0.99
array/reductions/mapreduce/Float32/1d 995729 ns 990792 ns 1.00
array/reductions/mapreduce/Float32/dims=1 782500 ns 810062.5 ns 0.97
array/reductions/mapreduce/Float32/dims=2 763291 ns 761104 ns 1.00
array/reductions/mapreduce/Float32/dims=1L 1735041 ns 1740812.5 ns 1.00
array/reductions/mapreduce/Float32/dims=2L 1761854 ns 1781292 ns 0.99
array/private/copyto!/gpu_to_gpu 637416 ns 651375 ns 0.98
array/private/copyto!/cpu_to_gpu 802750 ns 805542 ns 1.00
array/private/copyto!/gpu_to_cpu 800291.5 ns 817667 ns 0.98
array/private/iteration/findall/int 1828291.5 ns 1646500 ns 1.11
array/private/iteration/findall/bool 1601167 ns 1444584 ns 1.11
array/private/iteration/findfirst/int 1853209 ns 1754958.5 ns 1.06
array/private/iteration/findfirst/bool 1771604 ns 1703625 ns 1.04
array/private/iteration/scalar 4173625 ns 4772500 ns 0.87
array/private/iteration/logical 2695062 ns 2536917 ns 1.06
array/private/iteration/findmin/1d 1856916 ns 1815666 ns 1.02
array/private/iteration/findmin/2d 1412041 ns 1431750 ns 0.99
array/private/copy 558250 ns 538167 ns 1.04
array/shared/copyto!/gpu_to_gpu 83959 ns 86375 ns 0.97
array/shared/copyto!/cpu_to_gpu 82541.5 ns 86583 ns 0.95
array/shared/copyto!/gpu_to_cpu 82792 ns 84833 ns 0.98
array/shared/iteration/findall/int 1850292 ns 1609874.5 ns 1.15
array/shared/iteration/findall/bool 1607187.5 ns 1464354 ns 1.10
array/shared/iteration/findfirst/int 1408875 ns 1377750 ns 1.02
array/shared/iteration/findfirst/bool 1347166.5 ns 1319166 ns 1.02
array/shared/iteration/scalar 205708 ns 217500 ns 0.95
array/shared/iteration/logical 2387416 ns 2288708.5 ns 1.04
array/shared/iteration/findmin/1d 1399167 ns 1421750 ns 0.98
array/shared/iteration/findmin/2d 1422937.5 ns 1430854.5 ns 0.99
array/shared/copy 249042 ns 248666 ns 1.00
array/permutedims/4d 2439875 ns 2438438 ns 1.00
array/permutedims/2d 1169750 ns 1193250 ns 0.98
array/permutedims/3d 1740250 ns 1768458 ns 0.98
metal/synchronization/stream 19042 ns 19916 ns 0.96
metal/synchronization/context 20333 ns 20375 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.
diff --git a/test/runtests.jl b/test/runtests.jl
index 332f4a45..45c343ad 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -6,7 +6,7 @@ import REPL
 using Test
 
 using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="accumulatetests")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "accumulatetests")
 
 # Quit without erroring if Metal loaded without issues on unsupported platforms
 if !Sys.isapple()

@christiangnrd
Copy link
Member Author

christiangnrd commented Jul 20, 2025

As expected, some small regressions for most accumulate benchmarks, with a massive regression when accumulating along rows of a 3x1000000 matrix.

The performance improvement with column-wise accumulation with 3x1000000 matrices comes from Metal missing an easy optimization (see #626) Edit: I got confused this optimization is only present for reductions.

@christiangnrd christiangnrd changed the title Switch to GPUArrays.jl accumulate implementation Switch to GPUArrays.jl accumulate implementation Jul 20, 2025
Copy link

codecov bot commented Jul 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.35%. Comparing base (1942968) to head (b296d15).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #625      +/-   ##
==========================================
- Coverage   80.63%   80.35%   -0.29%     
==========================================
  Files          61       60       -1     
  Lines        2722     2678      -44     
==========================================
- Hits         2195     2152      -43     
+ Misses        527      526       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@christiangnrd christiangnrd changed the title Switch to GPUArrays.jl accumulate implementation [Do not merge] Switch to GPUArrays.jl accumulate implementation Jul 23, 2025
@maleadt
Copy link
Member

maleadt commented Jul 29, 2025

As expected, some small regressions for most accumulate benchmarks, with a massive regression when accumulating along rows of a 3x1000000 matrix.

I don't see a massive slowdown?

@christiangnrd
Copy link
Member Author

@maleadt The accumulate dims=2L benchmarks show a 2x slowdown. Did I get my rows/columns mixed up in my comment?

@maleadt
Copy link
Member

maleadt commented Jul 30, 2025

Oh OK, I didn't consider 2x a "massive slowdown" :-) Still something to look at of course, but much less dramatic than the 7x regressions we e.g. saw against CUDA.jl's reduction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants