Skip to content

Conversation

@MasonProtter
Copy link
Member

@MasonProtter MasonProtter commented Jul 28, 2025

Closes #8

This PR lets users customize whether or not the final reduction over task results is performed serially or in parallel. My experience so far has been that the overwhelming majority of the time, a serial final reduction is better, but there are cases where op is slow enough that it makes sense to parallelize, thus there's a new option for this.

Here's a demo of functionality with a slow reducing op (matrix multiplication)

using OhMyThreads, BenchmarkTools, LinearAlgebra
BLAS.set_num_threads(1)

Serial final reduction

julia> @btime treduce(*, v; nchunks) setup=begin
           N = 100
           v = [rand(N, N) for _  1:100]
           nchunks=50
       end;
  1.794 ms (612 allocations: 7.58 MiB)

Parallel final reduction:

julia> @btime treduce(*, v; nchunks, final_reduction_mode) setup=begin
           N = 100
           v = [rand(N, N) for _  1:100]
           nchunks=50
           final_reduction_mode=:parallel
       end;
  707.947 μs (1380 allocations: 7.64 MiB)

and for reference, here's the fully serial reduction:

julia> @btime reduce(*, v) setup=begin
           N = 100
           v = [rand(N, N) for _  1:100]
       end;
  3.203 ms (297 allocations: 7.56 MiB)

This is on a 8-core system, so we can see that in this case a parallelized final reduction got us much closer to full thread utilization, but we're still a ways off here due to effects like GC and spawning more tasks than actually necessary.

cc @kpamnany who asked about this at JuliaCon


TODO:

  • Discuss this in the documentation
  • Bikeshed names (I'm open to suggestions)
  • Add tests specifically for the parallel final reduction option

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement an option for tree-based reductions

2 participants