-
Notifications
You must be signed in to change notification settings - Fork 47
[Do not merge] Switch to GPUArrays.jl accumulate
implementation
#625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metal Benchmarks
Benchmark suite | Current: b296d15 | Previous: 1942968 | Ratio |
---|---|---|---|
latency/precompile |
9804167000 ns |
9844653958 ns |
1.00 |
latency/ttfp |
3988041166.5 ns |
3972040229 ns |
1.00 |
latency/import |
1280372083 ns |
1275530958.5 ns |
1.00 |
integration/metaldevrt |
822834 ns |
828500 ns |
0.99 |
integration/byval/slices=1 |
1544479.5 ns |
1536750 ns |
1.01 |
integration/byval/slices=3 |
10917542 ns |
9632625 ns |
1.13 |
integration/byval/reference |
1542000 ns |
1543583 ns |
1.00 |
integration/byval/slices=2 |
2646833.5 ns |
2621958.5 ns |
1.01 |
kernel/indexing |
544708 ns |
567792 ns |
0.96 |
kernel/indexing_checked |
552500 ns |
569292 ns |
0.97 |
kernel/launch |
8875 ns |
9208 ns |
0.96 |
array/construct |
6000 ns |
6625 ns |
0.91 |
array/broadcast |
567417 ns |
583375 ns |
0.97 |
array/random/randn/Float32 |
788125 ns |
784333 ns |
1.00 |
array/random/randn!/Float32 |
634250 ns |
623250 ns |
1.02 |
array/random/rand!/Int64 |
566645.5 ns |
547458 ns |
1.04 |
array/random/rand!/Float32 |
593833 ns |
585291 ns |
1.01 |
array/random/rand/Int64 |
750291.5 ns |
771250 ns |
0.97 |
array/random/rand/Float32 |
611167 ns |
622687 ns |
0.98 |
array/accumulate/Int64/1d |
2082292 ns |
1277104.5 ns |
1.63 |
array/accumulate/Int64/dims=1 |
2177000 ns |
1868333 ns |
1.17 |
array/accumulate/Int64/dims=2 |
2116083 ns |
2183625 ns |
0.97 |
array/accumulate/Int64/dims=1L |
6449187.5 ns |
11737104 ns |
0.55 |
array/accumulate/Int64/dims=2L |
17938375 ns |
9771416.5 ns |
1.84 |
array/accumulate/Float32/1d |
1693042 ns |
1142833 ns |
1.48 |
array/accumulate/Float32/dims=1 |
1982625 ns |
1570458 ns |
1.26 |
array/accumulate/Float32/dims=2 |
2028583 ns |
1931625 ns |
1.05 |
array/accumulate/Float32/dims=1L |
4866500 ns |
9864375 ns |
0.49 |
array/accumulate/Float32/dims=2L |
14886187.5 ns |
7308021 ns |
2.04 |
array/reductions/reduce/Int64/1d |
1432291.5 ns |
1373353.5 ns |
1.04 |
array/reductions/reduce/Int64/dims=1 |
1059375 ns |
1069291.5 ns |
0.99 |
array/reductions/reduce/Int64/dims=2 |
1181312.5 ns |
1193292 ns |
0.99 |
array/reductions/reduce/Int64/dims=1L |
2088917 ns |
2113062.5 ns |
0.99 |
array/reductions/reduce/Int64/dims=2L |
3433958 ns |
3456458 ns |
0.99 |
array/reductions/reduce/Float32/1d |
946750 ns |
971625 ns |
0.97 |
array/reductions/reduce/Float32/dims=1 |
783562.5 ns |
808458 ns |
0.97 |
array/reductions/reduce/Float32/dims=2 |
771312.5 ns |
768979 ns |
1.00 |
array/reductions/reduce/Float32/dims=1L |
1720625 ns |
1739041 ns |
0.99 |
array/reductions/reduce/Float32/dims=2L |
1760062.5 ns |
1772125 ns |
0.99 |
array/reductions/mapreduce/Int64/1d |
1489271 ns |
1456146 ns |
1.02 |
array/reductions/mapreduce/Int64/dims=1 |
1097750 ns |
1074875 ns |
1.02 |
array/reductions/mapreduce/Int64/dims=2 |
1118541.5 ns |
1206417 ns |
0.93 |
array/reductions/mapreduce/Int64/dims=1L |
2112562.5 ns |
2119292 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
3424187.5 ns |
3444375 ns |
0.99 |
array/reductions/mapreduce/Float32/1d |
995729 ns |
990792 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1 |
782500 ns |
810062.5 ns |
0.97 |
array/reductions/mapreduce/Float32/dims=2 |
763291 ns |
761104 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1L |
1735041 ns |
1740812.5 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2L |
1761854 ns |
1781292 ns |
0.99 |
array/private/copyto!/gpu_to_gpu |
637416 ns |
651375 ns |
0.98 |
array/private/copyto!/cpu_to_gpu |
802750 ns |
805542 ns |
1.00 |
array/private/copyto!/gpu_to_cpu |
800291.5 ns |
817667 ns |
0.98 |
array/private/iteration/findall/int |
1828291.5 ns |
1646500 ns |
1.11 |
array/private/iteration/findall/bool |
1601167 ns |
1444584 ns |
1.11 |
array/private/iteration/findfirst/int |
1853209 ns |
1754958.5 ns |
1.06 |
array/private/iteration/findfirst/bool |
1771604 ns |
1703625 ns |
1.04 |
array/private/iteration/scalar |
4173625 ns |
4772500 ns |
0.87 |
array/private/iteration/logical |
2695062 ns |
2536917 ns |
1.06 |
array/private/iteration/findmin/1d |
1856916 ns |
1815666 ns |
1.02 |
array/private/iteration/findmin/2d |
1412041 ns |
1431750 ns |
0.99 |
array/private/copy |
558250 ns |
538167 ns |
1.04 |
array/shared/copyto!/gpu_to_gpu |
83959 ns |
86375 ns |
0.97 |
array/shared/copyto!/cpu_to_gpu |
82541.5 ns |
86583 ns |
0.95 |
array/shared/copyto!/gpu_to_cpu |
82792 ns |
84833 ns |
0.98 |
array/shared/iteration/findall/int |
1850292 ns |
1609874.5 ns |
1.15 |
array/shared/iteration/findall/bool |
1607187.5 ns |
1464354 ns |
1.10 |
array/shared/iteration/findfirst/int |
1408875 ns |
1377750 ns |
1.02 |
array/shared/iteration/findfirst/bool |
1347166.5 ns |
1319166 ns |
1.02 |
array/shared/iteration/scalar |
205708 ns |
217500 ns |
0.95 |
array/shared/iteration/logical |
2387416 ns |
2288708.5 ns |
1.04 |
array/shared/iteration/findmin/1d |
1399167 ns |
1421750 ns |
0.98 |
array/shared/iteration/findmin/2d |
1422937.5 ns |
1430854.5 ns |
0.99 |
array/shared/copy |
249042 ns |
248666 ns |
1.00 |
array/permutedims/4d |
2439875 ns |
2438438 ns |
1.00 |
array/permutedims/2d |
1169750 ns |
1193250 ns |
0.98 |
array/permutedims/3d |
1740250 ns |
1768458 ns |
0.98 |
metal/synchronization/stream |
19042 ns |
19916 ns |
0.96 |
metal/synchronization/context |
20333 ns |
20375 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/test/runtests.jl b/test/runtests.jl
index 332f4a45..45c343ad 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -6,7 +6,7 @@ import REPL
using Test
using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="accumulatetests")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "accumulatetests")
# Quit without erroring if Metal loaded without issues on unsupported platforms
if !Sys.isapple() |
As expected, some small regressions for most accumulate benchmarks, with a massive regression when accumulating along rows of a 3x1000000 matrix.
|
accumulate
implementation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #625 +/- ##
==========================================
- Coverage 80.63% 80.35% -0.29%
==========================================
Files 61 60 -1
Lines 2722 2678 -44
==========================================
- Hits 2195 2152 -43
+ Misses 527 526 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
accumulate
implementationaccumulate
implementation
I don't see a massive slowdown? |
@maleadt The accumulate |
Oh OK, I didn't consider 2x a "massive slowdown" :-) Still something to look at of course, but much less dramatic than the 7x regressions we e.g. saw against CUDA.jl's reduction. |
Opened to run benchmarks.
Todo: