-
Notifications
You must be signed in to change notification settings - Fork 46
[Do not merge] Switch to GPUArrays.jl reduction implementation #628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/perf/runbenchmarks.jl b/perf/runbenchmarks.jl
index ba5e0d40..1d7901c5 100644
--- a/perf/runbenchmarks.jl
+++ b/perf/runbenchmarks.jl
@@ -1,6 +1,6 @@
# benchmark suite execution and codespeed submission
using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="akreduce")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "akreduce")
using Metal
diff --git a/test/runtests.jl b/test/runtests.jl
index 4ee51134..fb376e4f 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -6,7 +6,7 @@ import REPL
using Test
using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="akreduce")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "akreduce")
# Quit without erroring if Metal loaded without issues on unsupported platforms
if !Sys.isapple() |
|
Leaving the current |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #628 +/- ##
=======================================
Coverage 80.63% 80.63%
=======================================
Files 61 61
Lines 2722 2722
=======================================
Hits 2195 2195
Misses 527 527 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metal Benchmarks
| Benchmark suite | Current: c0eddd1 | Previous: 1942968 | Ratio |
|---|---|---|---|
latency/precompile |
9830015416 ns |
9844653958 ns |
1.00 |
latency/ttfp |
3989128875 ns |
3972040229 ns |
1.00 |
latency/import |
1281988208 ns |
1275530958.5 ns |
1.01 |
integration/metaldevrt |
830312.5 ns |
828500 ns |
1.00 |
integration/byval/slices=1 |
1532291.5 ns |
1536750 ns |
1.00 |
integration/byval/slices=3 |
8864917 ns |
9632625 ns |
0.92 |
integration/byval/reference |
1535333 ns |
1543583 ns |
0.99 |
integration/byval/slices=2 |
2554083 ns |
2621958.5 ns |
0.97 |
kernel/indexing |
582792 ns |
567792 ns |
1.03 |
kernel/indexing_checked |
577208 ns |
569292 ns |
1.01 |
kernel/launch |
9042 ns |
9208 ns |
0.98 |
array/construct |
6125 ns |
6625 ns |
0.92 |
array/broadcast |
579250 ns |
583375 ns |
0.99 |
array/random/randn/Float32 |
821167 ns |
784333 ns |
1.05 |
array/random/randn!/Float32 |
622625 ns |
623250 ns |
1.00 |
array/random/rand!/Int64 |
555395.5 ns |
547458 ns |
1.01 |
array/random/rand!/Float32 |
584125 ns |
585291 ns |
1.00 |
array/random/rand/Int64 |
777375 ns |
771250 ns |
1.01 |
array/random/rand/Float32 |
628375 ns |
622687 ns |
1.01 |
array/accumulate/Int64/1d |
1261292 ns |
1277104.5 ns |
0.99 |
array/accumulate/Int64/dims=1 |
1800500 ns |
1868333 ns |
0.96 |
array/accumulate/Int64/dims=2 |
2165958.5 ns |
2183625 ns |
0.99 |
array/accumulate/Int64/dims=1L |
11643104 ns |
11737104 ns |
0.99 |
array/accumulate/Int64/dims=2L |
9718917 ns |
9771416.5 ns |
0.99 |
array/accumulate/Float32/1d |
1141375 ns |
1142833 ns |
1.00 |
array/accumulate/Float32/dims=1 |
1562333.5 ns |
1570458 ns |
0.99 |
array/accumulate/Float32/dims=2 |
1865875 ns |
1931625 ns |
0.97 |
array/accumulate/Float32/dims=1L |
9890916.5 ns |
9864375 ns |
1.00 |
array/accumulate/Float32/dims=2L |
7298500 ns |
7308021 ns |
1.00 |
array/reductions/reduce/Int64/1d |
1077583 ns |
1373353.5 ns |
0.78 |
array/reductions/reduce/Int64/dims=1 |
987500 ns |
1069291.5 ns |
0.92 |
array/reductions/reduce/Int64/dims=2 |
935145.5 ns |
1193292 ns |
0.78 |
array/reductions/reduce/Int64/dims=1L |
2350750 ns |
2113062.5 ns |
1.11 |
array/reductions/reduce/Int64/dims=2L |
2815291 ns |
3456458 ns |
0.81 |
array/reductions/reduce/Float32/1d |
1029750 ns |
971625 ns |
1.06 |
array/reductions/reduce/Float32/dims=1 |
956125 ns |
808458 ns |
1.18 |
array/reductions/reduce/Float32/dims=2 |
870375 ns |
768979 ns |
1.13 |
array/reductions/reduce/Float32/dims=1L |
1659354.5 ns |
1739041 ns |
0.95 |
array/reductions/reduce/Float32/dims=2L |
2781167 ns |
1772125 ns |
1.57 |
array/reductions/mapreduce/Int64/1d |
1000375 ns |
1456146 ns |
0.69 |
array/reductions/mapreduce/Int64/dims=1 |
936083 ns |
1074875 ns |
0.87 |
array/reductions/mapreduce/Int64/dims=2 |
873500 ns |
1206417 ns |
0.72 |
array/reductions/mapreduce/Int64/dims=1L |
2346562.5 ns |
2119292 ns |
1.11 |
array/reductions/mapreduce/Int64/dims=2L |
2844729 ns |
3444375 ns |
0.83 |
array/reductions/mapreduce/Float32/1d |
1045959 ns |
990792 ns |
1.06 |
array/reductions/mapreduce/Float32/dims=1 |
947959 ns |
810062.5 ns |
1.17 |
array/reductions/mapreduce/Float32/dims=2 |
868041.5 ns |
761104 ns |
1.14 |
array/reductions/mapreduce/Float32/dims=1L |
1668167 ns |
1740812.5 ns |
0.96 |
array/reductions/mapreduce/Float32/dims=2L |
2815354.5 ns |
1781292 ns |
1.58 |
array/private/copyto!/gpu_to_gpu |
636791 ns |
651375 ns |
0.98 |
array/private/copyto!/cpu_to_gpu |
795791 ns |
805542 ns |
0.99 |
array/private/copyto!/gpu_to_cpu |
811292 ns |
817667 ns |
0.99 |
array/private/iteration/findall/int |
1657000 ns |
1646500 ns |
1.01 |
array/private/iteration/findall/bool |
1451937.5 ns |
1444584 ns |
1.01 |
array/private/iteration/findfirst/int |
2074750 ns |
1754958.5 ns |
1.18 |
array/private/iteration/findfirst/bool |
1635145.5 ns |
1703625 ns |
0.96 |
array/private/iteration/scalar |
5542583.5 ns |
4772500 ns |
1.16 |
array/private/iteration/logical |
2734958 ns |
2536917 ns |
1.08 |
array/private/iteration/findmin/1d |
1870167 ns |
1815666 ns |
1.03 |
array/private/iteration/findmin/2d |
1891583.5 ns |
1431750 ns |
1.32 |
array/private/copy |
573791.5 ns |
538167 ns |
1.07 |
array/shared/copyto!/gpu_to_gpu |
83750 ns |
86375 ns |
0.97 |
array/shared/copyto!/cpu_to_gpu |
82625 ns |
86583 ns |
0.95 |
array/shared/copyto!/gpu_to_cpu |
91458 ns |
84833 ns |
1.08 |
array/shared/iteration/findall/int |
1643437.5 ns |
1609874.5 ns |
1.02 |
array/shared/iteration/findall/bool |
1471812.5 ns |
1464354 ns |
1.01 |
array/shared/iteration/findfirst/int |
1830375 ns |
1377750 ns |
1.33 |
array/shared/iteration/findfirst/bool |
1385917 ns |
1319166 ns |
1.05 |
array/shared/iteration/scalar |
206917 ns |
217500 ns |
0.95 |
array/shared/iteration/logical |
2750042 ns |
2288708.5 ns |
1.20 |
array/shared/iteration/findmin/1d |
1607895.5 ns |
1421750 ns |
1.13 |
array/shared/iteration/findmin/2d |
1917291.5 ns |
1430854.5 ns |
1.34 |
array/shared/copy |
251042 ns |
248666 ns |
1.01 |
array/permutedims/4d |
2442208 ns |
2438438 ns |
1.00 |
array/permutedims/2d |
1184291.5 ns |
1193250 ns |
0.99 |
array/permutedims/3d |
1737625 ns |
1768458 ns |
0.98 |
metal/synchronization/stream |
19667 ns |
19916 ns |
0.99 |
metal/synchronization/context |
20292 ns |
20375 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
I think I'd rather we do it in one pass, because the change needs to be made across back-ends. |
|
In any case, despite some regressions the overall performance seems better here than over in CUDA.jl. |
Don't remove the file yet to avoid merge conflict with #627