[LoadStoreOpToLLVM] Transpose 2d load. #4870

chengjunlu · 2025-08-11T06:51:06Z

To use the transpose 2d block io to load column major matrix from global memory. (The column major matrix here could be generalized to the cases that fast change dimension of register layout is not same as the fast change dim on global memory.)

The 2d block io only can transpose the matrix of i32 type when load matrix from memory to register. To transpose the matrix of type bits number < 32, we need to further transpose matrix inside the register.

The steps the to load matrix with transposing with 2D block IO:

To load the matrix as d32 type matrix from memory with transposed to register.
(Optional if scalar type < 32 bits) To transpose the MxNxd32 to Mx(32/m)xNxdm inside the register.

Right now we only use the bitcast operation to transpose the matrix of which the width is equal to the threads per warp for step 2.

Further we will support more matrix of which the width is not equal to the threads per warp.

Copilot

Pull Request Overview

This draft PR implements transpose 2D block load functionality to efficiently load column major matrices from global memory on Intel Xe+ GPUs. The implementation introduces a transpose operation when the register layout's fast-changing dimension differs from the memory layout, using d32 type matrices with bitcast operations for the transformation.

Added support for transpose 2D block IO operations with transpose parameter
Enhanced block IO tile size calculation to handle transpose scenarios
Implemented new test coverage for transpose and column major load operations

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File	Description
LoadStoreOpToLLVM.cpp	Major refactoring of 2D block load implementation to support transpose operations and simplified layout handling
tensor-pointer-load-block-2d.mlir	Updated test expectations for new block load configurations and tile sizes
test_block_store.py	Added transpose parameter and column major test cases for block operations

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

python/test/unit/intel/test_block_io.py

python/test/unit/intel/test_block_store.py

chengjunlu · 2025-11-10T05:40:10Z

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

+      packedElemSizeInBits = 32;
+      numPackedVals = packedElemSizeInBits / elemSizeInBits;
+
+      // Improve this. The current 2D block load only transposes the matrix at


The improvements will be added in another PR to minimal the changes in a single PR.

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

python/test/unit/intel/test_block_io.py

chengjunlu · 2025-11-10T05:42:16Z

@whitneywhtsang @etiotto , The transpose loading is ready for review.

Signed-off-by: Lu,Chengjun <[email protected]>

whitneywhtsang · 2025-11-13T05:30:51Z

Can you fix the typo in the image of the PR description or remove it?

whitneywhtsang · 2025-11-13T05:35:38Z

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp

-    return axisInfo ? axisInfo->getStride(dim) : -1;
+    if (axisInfo) {
+      const SmallVector<int64_t> &stride = axisInfo->getStride();
+      if (dim < stride.size()) {


why would we call getStride with dim more than the size of stride?

This is not a typical case in real Triton kernel but for LIT test cases.

There are simple LIT cases which is not real input kernel by user, like this

intel-xpu-backend-for-triton/test/TritonIntelGPU/tensor-pointer-load-block-2d.mlir

Line 345 in 248ae4c

tt.func public @regular_pointer_gather_io(%arg0: tensor<128x64x!tt.ptr<f16>, #mma>,

The arguments of tensor type of the function is converted to LLVM struct type before the axis info analysis pass run. The axis info initilize the AxisInfo with only one dimenssion for those non-tensor type. The original code will dereference the stride information with dim > 1 which is out of the boundary of the AxisInfo for the operands.

This is just a simple protection to return unknown stride for out-of-boundary dim of AxisInfo.

chengjunlu · 2025-11-13T14:30:43Z

Can you fix the typo in the image of the PR description or remove it?

Description has been updated for the code of this PR.

chengjunlu requested review from Copilot, etiotto and whitneywhtsang August 11, 2025 06:51

chengjunlu mentioned this pull request Aug 11, 2025

[EXPERIMENTAL]: Load column major matrix with 2d block io #4604

Closed

Copilot AI reviewed Aug 11, 2025

View reviewed changes

chengjunlu linked an issue Aug 11, 2025 that may be closed by this pull request

[06-fused-attention] Determine if FP8 operand B can use 2d block load #3572

Open

chengjunlu force-pushed the chengjun/trans_2d_load branch from efff84d to 55c896e Compare August 11, 2025 07:42

etiotto marked this pull request as draft October 9, 2025 14:09

chengjunlu force-pushed the chengjun/trans_2d_load branch from 20a1637 to 942ca37 Compare November 4, 2025 04:49

chengjunlu changed the title ~~[Draft] Transpose 2d load.~~ [LoadStoreOpToLLVM] Transpose 2d load. Nov 4, 2025

chengjunlu marked this pull request as ready for review November 4, 2025 04:50

chengjunlu force-pushed the chengjun/trans_2d_load branch 7 times, most recently from 210886e to e979428 Compare November 10, 2025 05:37

chengjunlu commented Nov 10, 2025

View reviewed changes

chengjunlu requested a review from Copilot November 10, 2025 05:41

Copilot AI reviewed Nov 10, 2025

View reviewed changes

third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp Show resolved Hide resolved

python/test/unit/intel/test_block_io.py Outdated Show resolved Hide resolved

chengjunlu force-pushed the chengjun/trans_2d_load branch from e979428 to 248ae4c Compare November 12, 2025 03:00

[LoadStoreOpToLLVM] Transposed 2d load.

248ae4c

Signed-off-by: Lu,Chengjun <[email protected]>

whitneywhtsang reviewed Nov 13, 2025

View reviewed changes

[LoadStoreOpToLLVM] Transpose 2d load. #4870

Are you sure you want to change the base?

[LoadStoreOpToLLVM] Transpose 2d load. #4870

Uh oh!

Conversation

chengjunlu commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chengjunlu Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

chengjunlu commented Nov 10, 2025

Uh oh!

whitneywhtsang commented Nov 13, 2025

Uh oh!

whitneywhtsang Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

chengjunlu Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chengjunlu commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chengjunlu commented Aug 11, 2025 •

edited

Loading

chengjunlu Nov 13, 2025 •

edited

Loading