Skip to content

[asm] Add pipelined double-buffering support with SGPR rotation#876

Merged
harsh-nod merged 1 commit intoiree-org:mainfrom
harsh-nod:mubuf_asm
Feb 17, 2026
Merged

[asm] Add pipelined double-buffering support with SGPR rotation#876
harsh-nod merged 1 commit intoiree-org:mainfrom
harsh-nod:mubuf_asm

Conversation

@harsh-nod
Copy link
Collaborator

Implement memref iter_arg handling for pipelined GEMM with g2s in the C++ WaveASM backend. When scf.for carries memref iter_args for double-buffering, the LDS base offsets are now materialized as SGPRs and rotated at the loop tail using s_mov_b32 swap sequences.

Key changes:

  • RegionBuilder: detect LDS memref iter_args, resolve to SGPR offsets, propagate through block args, handle cross-swap at yield
  • TranslateFromMLIR: use V_ADD_U32 directly with SGPR offsets in vector.load/store (V_MOV_B32 rejects SGPR sources)
  • AMDGPUHandlers: handle dynamic SGPR-carried LDS base offsets in gather_to_lds m0 computation, prefer SALU when both operands are SGPRs
  • LinearScanPass: fix block arg type propagation to use allocation mapping directly instead of condition iter_arg types (broken for cross-swap patterns)
  • AssemblyEmitter: emit SGPR rotation copies at loop tail, detecting independent swap pairs and using 3-instruction swap with temporary

Implement memref iter_arg handling for pipelined GEMM with g2s in the
C++ WaveASM backend. When scf.for carries memref iter_args for
double-buffering, the LDS base offsets are now materialized as SGPRs
and rotated at the loop tail using s_mov_b32 swap sequences.

Key changes:
- RegionBuilder: detect LDS memref iter_args, resolve to SGPR offsets,
  propagate through block args, handle cross-swap at yield
- TranslateFromMLIR: use V_ADD_U32 directly with SGPR offsets in
  vector.load/store (V_MOV_B32 rejects SGPR sources)
- AMDGPUHandlers: handle dynamic SGPR-carried LDS base offsets in
  gather_to_lds m0 computation, prefer SALU when both operands are SGPRs
- LinearScanPass: fix block arg type propagation to use allocation
  mapping directly instead of condition iter_arg types (broken for
  cross-swap patterns)
- AssemblyEmitter: emit SGPR rotation copies at loop tail, detecting
  independent swap pairs and using 3-instruction swap with temporary

Signed-off-by: Harsh Menon <harsh.menon@amd.com>
@harsh-nod harsh-nod merged commit 47653e7 into iree-org:main Feb 17, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants