-
Notifications
You must be signed in to change notification settings - Fork 75
[LoadStoreOpToLLVM] Transpose 2d load. #4870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This draft PR implements transpose 2D block load functionality to efficiently load column major matrices from global memory on Intel Xe+ GPUs. The implementation introduces a transpose operation when the register layout's fast-changing dimension differs from the memory layout, using d32 type matrices with bitcast operations for the transformation.
- Added support for transpose 2D block IO operations with transpose parameter
- Enhanced block IO tile size calculation to handle transpose scenarios
- Implemented new test coverage for transpose and column major load operations
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| LoadStoreOpToLLVM.cpp | Major refactoring of 2D block load implementation to support transpose operations and simplified layout handling |
| tensor-pointer-load-block-2d.mlir | Updated test expectations for new block load configurations and tile sizes |
| test_block_store.py | Added transpose parameter and column major test cases for block operations |
third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUToLLVM/LoadStoreOpToLLVM.cpp
Outdated
Show resolved
Hide resolved
efff84d to
55c896e
Compare
20a1637 to
942ca37
Compare
210886e to
e979428
Compare
| packedElemSizeInBits = 32; | ||
| numPackedVals = packedElemSizeInBits / elemSizeInBits; | ||
|
|
||
| // Improve this. The current 2D block load only transposes the matrix at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The improvements will be added in another PR to minimal the changes in a single PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@whitneywhtsang @etiotto , The transpose loading is ready for review. |
e979428 to
248ae4c
Compare
Signed-off-by: Lu,Chengjun <[email protected]>
|
Can you fix the typo in the image of the PR description or remove it? |
| return axisInfo ? axisInfo->getStride(dim) : -1; | ||
| if (axisInfo) { | ||
| const SmallVector<int64_t> &stride = axisInfo->getStride(); | ||
| if (dim < stride.size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would we call getStride with dim more than the size of stride?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a typical case in real Triton kernel but for LIT test cases.
There are simple LIT cases which is not real input kernel by user, like this
intel-xpu-backend-for-triton/test/TritonIntelGPU/tensor-pointer-load-block-2d.mlir
Line 345 in 248ae4c
| tt.func public @regular_pointer_gather_io(%arg0: tensor<128x64x!tt.ptr<f16>, #mma>, |
The arguments of tensor type of the function is converted to LLVM struct type before the axis info analysis pass run. The axis info initilize the AxisInfo with only one dimenssion for those non-tensor type. The original code will dereference the stride information with dim > 1 which is out of the boundary of the AxisInfo for the operands.
This is just a simple protection to return unknown stride for out-of-boundary dim of AxisInfo.
Description has been updated for the code of this PR. |
To use the transpose 2d block io to load column major matrix from global memory. (The column major matrix here could be generalized to the cases that fast change dimension of register layout is not same as the fast change dim on global memory.)
The 2d block io only can transpose the matrix of i32 type when load matrix from memory to register. To transpose the matrix of type bits number < 32, we need to further transpose matrix inside the register.
The steps the to load matrix with transposing with 2D block IO:
MxNxd32toMx(32/m)xNxdminside the register.Right now we only use the bitcast operation to transpose the matrix of which the width is equal to the threads per warp for step 2.
Further we will support more matrix of which the width is not equal to the threads per warp.