Add dual-GEMM examples for SM90 (Hopper) and SM120 (Blackwell) #2694

Inodayy · 2025-10-13T14:50:57Z

Summary

Implements dual-GEMM examples for SM90 (Hopper) and SM120 (Blackwell) using CUTLASS 3.x.

The dual-GEMM operation implemented is:

  D0 = epilogue0(X @ B0, C0)
  D1 = epilogue1(X @ B1, C1)
  D2 = element_wise(D0, D1)

Implementation details

Based on the single-GEMM examples 48_hopper_warp_specialized_gemm.cu
and 79a_blackwell_geforce_nvfp4_bf16_gemm.cu
B0 and B1 layouts are not decoupled, but both are passed separately to the builders for potential future flexibility.
(Blackwell supports only TN layout; Hopper assumes NK layout for make_tma_copy_B_sm90 etc.)
D2 performs LeftSiLUAndMul similar to example 45_dual_gemm, implemented in collective/sm90_epilogue_tma_warpspecialized_dual.hpp store()
D0 and D1 are intermediate results only and are not stored.
Added template<class Op0, class Op1> in fusion/sm90_callbacks… to allow distinct operations for D0 and D1.

Performance (keeping all configurations same as single-GEMM examples)

SM90 (Hopper)

Problem size: 2048×2048×2048
Rasterization: Heuristic with max CTA swizzle 2
Avg runtime: 0.20429 ms
GFLOPS: 168,191
≈5% faster than two single-GEMM baseline

SM120 (Blackwell)

Problem size: 2048×2048×2048
Avg runtime: 0.155648 ms
GFLOPS: 220,753
≈30% slower than two single-GEMM baseline (haven’t been able to find the root cause yet)

Notes

I am relatively new to CUTLASS C++; this work was implemented as a learning exercise. I followed example structure similar to 63_hopper_gemm_with_weight_prefetch.
The SM120 example was an initial local starting point and can be removed if unnecessary

Closes #1123

Inodayy · 2025-10-20T23:26:52Z

@hwu36 @mnicely Hi, just checking whether 3.x dual-gemm is still planned, and if there’s any chance this PR might get reviewed later if time allows? I’d appreciate any feedback on whether I’m on the right track. Thanks!

Add dual-GEMM examples for SM90 (Hopper) and SM120 (Blackwell)

6f6e8c2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add dual-GEMM examples for SM90 (Hopper) and SM120 (Blackwell) #2694

Add dual-GEMM examples for SM90 (Hopper) and SM120 (Blackwell) #2694

Uh oh!

Inodayy commented Oct 13, 2025 •

edited

Loading

Uh oh!

Inodayy commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add dual-GEMM examples for SM90 (Hopper) and SM120 (Blackwell) #2694

Are you sure you want to change the base?

Add dual-GEMM examples for SM90 (Hopper) and SM120 (Blackwell) #2694

Uh oh!

Conversation

Inodayy commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation details

Performance (keeping all configurations same as single-GEMM examples)

Notes

Uh oh!

Inodayy commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Inodayy commented Oct 13, 2025 •

edited

Loading