-
Notifications
You must be signed in to change notification settings - Fork 75
Convert block ptr type layouts to Subgroup2DBlockEncoding layouts
#4463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
6d8d666 to
098c6c8
Compare
b86b08f to
9b56562
Compare
Adds the end-to-end dot product on block ptr testing to the block load unit test (maybe it should be renamed `test_block_ptr.py`?). Adds additional shapes, A transpose, and B transpose. The cold runtime (no cache) is approximately 1 minute on PVC 1100 in my environment. I picked the block shapes somewhat randomly, trying to balance breadth and runtime. This somewhat duplicates tutorial 10 but allows us to run many more combinations in shorter time. I added this because #4463 is passing CI but has a few bugs that are not being caught by existing unit tests, including tutorials. --------- Co-authored-by: Whitney Tsang <[email protected]>
9b56562 to
09fec1e
Compare
third_party/intel/include/Dialect/TritonIntelGPU/Transforms/Passes.td
Outdated
Show resolved
Hide resolved
| def TritonIntelGPUOptimizeBlockIOEncodingPass : Pass<"tritonintelgpu-optimize-block-io-encoding", "mlir::ModuleOp"> { | ||
| let summary = "Set encodings on candidates for Subgroup 2D Block IO ops"; | ||
|
|
||
| let description = [{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: tt.load operations feeding a tt.dot operation are expected to have "dot" encoding in their type. How does this work with the new encoding ? Will we need a layout conversion operation to change the layout from "BlockEncoding" to "Dot" encoding ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently a tt.load op is expected to have DotOperandEncoding and the block_io tag if we want to lower it to Subgroup 2D Block loads. This pass changes the type of the tt.load to have a Subgroup2DBlockEncoding layout but inserts a layout conversion back to the original type. Because the Subgroup2DBlockEncoding layout is not an anchor layout it is replaced in RemoveLayoutConversions with the DotOperandEncoding layout.
In #4500 I make Subgroup2DBLockEncoding an anchor layout and teach LoadStoreOpToLLVM to "see through" the layout conversion, which becomes a no-op. But the ultimate goal (which I am working on now) is to move the layout conversion (from subgroup 2d block load to DPAS in registers) out of LoadStoreOpToLLVM and into a ConvertLayoutOp, so eventually LoadStoreOpToLLVM would not depend directly on DPAS to lower block ptr / tensor desc loads to 2d block IO instructions.
|
@chengjunlu have you taken an initial look at this proposed pass ? Opinions? |
etiotto
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add lit test that exercise and scf.while loop and a if statement as well.
| namespace mlir { | ||
| namespace triton { | ||
| namespace gpu::intel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| namespace mlir { | |
| namespace triton { | |
| namespace gpu::intel { | |
| namespace mlir::triton::gpu::intel { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
| class TritonIntelGPUOptimizeBlockIOEncodingPass | ||
| : public impl::TritonIntelGPUOptimizeBlockIOEncodingPassBase< | ||
| TritonIntelGPUOptimizeBlockIOEncodingPass> { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Suggestion]: Can you put the public members at the top and the private members at the bottom of the class. That way the public interface is upfront.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemed like upstream passes put runOnOperation at the bottom of the file by convention - so when you want to know what a pass does, you start by scrolling all the way down and read bottom to top. I tried to follow that, but if we want to standardize and write down a different convention I think that would be fine.
| // CHECK: #mma1 = #ttig.subgroup_2d_block<{warpsPerCTA = [8, 4], instrShape = [16, 16], numBlocks=2, order=[0, 1], kWidth=2, threadsPerWarp=16}> | ||
| // CHECK: #mma2 = #ttig.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [8, 4], repCluster = [4, 2], A = [32, 16], B = [16, 32], C = [32, 32]}> | ||
| #mma = #ttig.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [8, 4], repCluster = [4, 2], A = [32, 16], B = [16, 32], C = [32, 32]}> | ||
| module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 32 : i32, ttg.target = "xpu", "ttg.threads-per-warp" = 16 : i32, ttig.min_sg_size = 16 : i32, ttig.support_bf16_conversion, ttig.support_dpas, ttig.support_sg_2d_block, ttig.target_arch = "spir64"} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the module declaration be simplified to:
module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 32 : i32} {`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I removed some of the ones but left a combination of attributes that I believe are relevant and attributes that are required.
| %11 = arith.muli %8, %c256_i32 : i32 | ||
| // CHECK: tt.make_tensor_ptr {{.*}} : <tensor<32x256xf16, #mma1>> | ||
| %12 = tt.make_tensor_ptr %arg1, [%c5120_i64, %c4096_i64], [%c4096_i64, %c1_i64], [%c0_i32, %11] {order = array<i32: 1, 0>} : <tensor<32x256xf16, #blocked2>> | ||
| %13:3 = scf.for %arg3 = %c0_i32 to %c5120_i32 step %c32_i32 iter_args(%arg4 = %cst, %arg5 = %10, %arg6 = %12) -> (tensor<256x256xf32, #blocked>, !tt.ptr<tensor<256x32xf16, #blocked1>>, !tt.ptr<tensor<32x256xf16, #blocked2>>) : i32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a CHECK to ensure that the loop init args have the new tensor ptr type with %mma layout.
The return types of the scf.for operation should be CHECKED too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type of the scf is not modified - so I am not sure what we need to check? The tt.dot return type is unchanged and the make tensor ptr / store uses the same layout as before.
%26 = tt.dot %24, %25, %23, inputPrecision = tf32 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma2, kWidth = 1}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma2, kWidth = 2}>> -> tensor<256x256xf32, #mma2>
%27 = ttg.convert_layout %26 : tensor<256x256xf32, #mma2> -> tensor<256x256xf32, #blocked>
%28 = tt.advance %arg5, [%c0_i32, %c32_i32] : <tensor<256x32xf16, #mma>>
%29 = tt.advance %arg6, [%c32_i32, %c0_i32] : <tensor<32x256xf16, #mma1>>
scf.yield %27, %28, %29 : tensor<256x256xf32, #blocked>, !tt.ptr<tensor<256x32xf16, #mma>>, !tt.ptr<tensor<32x256xf16, #mma1>>
}
%14 = tt.make_tensor_ptr %arg2, [%c1024_i64, %c4096_i64], [%c4096_i64, %c1_i64], [%9, %11] {order = array<i32: 1, 0>} : <tensor<256x256xf16, #blocked2>>
%15 = arith.truncf %13#0 : tensor<256x256xf32, #blocked> to tensor<256x256xf16, #blocked>
%16 = ttg.convert_layout %15 : tensor<256x256xf16, #blocked> -> tensor<256x256xf16, #blocked2>
tt.store %14, %16 {boundaryCheck = array<i32: 0, 1>} : !tt.ptr<tensor<256x256xf16, #blocked2>>
tt.return
I can add a check to make sure the arguments of the scf for loop are using the new values, but the verifier for the scf for should not allowed a type mismatch so I don't think we need another type check (I briefly looked through other lit examples and did not see arg type checking for scf.for).
| scf.yield %25, %26, %27 : tensor<256x256xf32, #blocked>, !tt.ptr<tensor<256x32xf16, #blocked1>>, !tt.ptr<tensor<32x256xf16, #blocked2>> | ||
| } | ||
| %14 = tt.make_tensor_ptr %arg2, [%c1024_i64, %c4096_i64], [%c4096_i64, %c1_i64], [%9, %11] {order = array<i32: 1, 0>} : <tensor<256x256xf16, #blocked2>> | ||
| %15 = arith.truncf %13#0 : tensor<256x256xf32, #blocked> to tensor<256x256xf16, #blocked> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add CHECK to verify that %13#0 has the dpas layout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't have the DPAS layout, it will be blocked. The DPAS layout is converted to blocked in #25.
| namespace { | ||
|
|
||
| SmallVector<Value> getTiedArgs(Operation *op, int resultIdx) { | ||
| if (auto forOp = dyn_cast<scf::ForOp>(op)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code for scf.for and scf.while can be commoned by dynamic casting op to a LoopLikeOpInterface, which both operations implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to access op specific methods, don't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoopLikeOpInterface should be get loop info with common name.
| auto yieldVal = | ||
| whileOp.getBeforeBody()->getTerminator()->getOperand(resultIdx); | ||
| auto initVal = whileOp.getOperands()[resultIdx]; | ||
| return {iterArg, result, iterArg, initVal}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a return there is no need for and else if, can you pls change it to an if.
|
Hi @etiotto we just discussed in the architecture meeting refactoring |
| tt.store %14, %16 {boundaryCheck = array<i32: 0, 1>} : !tt.ptr<tensor<256x256xf16, #blocked2>> | ||
| tt.return | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a test containing an scf.while loop:
#blocked = #ttg.blocked<{sizePerThread = [4, 4], threadsPerWarp = [1, 16], warpsPerCTA = [8, 4], order = [1, 0]}>
#blocked1 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [4, 4], warpsPerCTA = [32, 1], order = [1, 0]}>
#blocked2 = #ttg.blocked<{sizePerThread = [1, 8], threadsPerWarp = [1, 16], warpsPerCTA = [16, 2], order = [1, 0]}>
#mma = #ttig.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 16, warpsPerCTA = [8, 4], repCluster = [4, 2], A = [32, 16], B = [16, 32], C = [32, 32]}>
module attributes {"ttg.num-ctas" = 1 : i32, "ttg.num-warps" = 32 : i32, "ttg.threads-per-warp" = 16 : i32} {
tt.func public @test_while_loop(%arg0: !tt.ptr<f16>, %arg1: !tt.ptr<f16>, %arg2: i1, %B: tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>>) {
%c0_i32 = arith.constant 0 : i32
%c1_i64 = arith.constant 1 : i64
%c32_i32 = arith.constant 32 : i32
%c256_i32 = arith.constant 256 : i32
%c1024_i64 = arith.constant 1024 : i64
%cst = arith.constant dense<0.000000e+00> : tensor<256x256xf32, #mma>
%0 = tt.make_tensor_ptr %arg1, [%c1024_i64, %c1_i64], [%c1_i64, %c1024_i64], [%c0_i32, %c0_i32] {order = array<i32: 1, 0>} : <tensor<256x32xf16, #blocked1>>
%1:2 = scf.while (%arg3 = %0, %arg4 = %c0_i32) : (!tt.ptr<tensor<256x32xf16, #blocked1>>, i32) -> (!tt.ptr<tensor<256x32xf16, #blocked1>>, i32) {
scf.condition(%arg2) %arg3, %arg4 : !tt.ptr<tensor<256x32xf16, #blocked1>>, i32
} do {
^bb0(%arg3: !tt.ptr<tensor<256x32xf16, #blocked1>>, %arg4: i32):
%3 = tt.advance %arg3, [%c256_i32, %c0_i32] : <tensor<256x32xf16, #blocked1>>
%A = tt.load %3 {boundaryCheck = array<i32: 0, 1>, ttig.block_io = "row_major"} : !tt.ptr<tensor<256x32xf16, #blocked1>>
%convA1 = ttg.convert_layout %A : tensor<256x32xf16, #blocked1> -> tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>>
%convA2 = ttg.convert_layout %convA1 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #blocked}>> -> tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>>
%res = tt.dot %convA2, %B, %cst, inputPrecision = tf32 : tensor<256x32xf16, #ttg.dot_op<{opIdx = 0, parent = #mma, kWidth = 1}>> * tensor<32x256xf16, #ttg.dot_op<{opIdx = 1, parent = #mma, kWidth = 2}>> -> tensor<256x256xf32, #mma>
%7 = arith.addi %arg4, %c32_i32 : i32
scf.yield %3, %7 : !tt.ptr<tensor<256x32xf16, #blocked1>>, i32
}
tt.return
}
}
This is currently failing, I'd expect this to work given that the pass contains some code to handle while loops.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0. Program arguments: /home/jovyan/intel-xpu-backend-for-triton/build/cmake.linux-x86_64-cpython-3.10/bin/triton-opt /home/jovyan/tmp/test4.mlir --split-input-file --tritonintelgpu-optimize-block-io-encoding
#0 0x00005633d27e59b7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/jovyan/intel-xpu-backend-for-triton/build/cmake.linux-x86_64-cpython-3.10/bin/triton-opt+0x483d9b7)
#1 0x00005633d27e361e llvm::sys::RunSignalHandlers() (/home/jovyan/intel-xpu-backend-for-triton/build/cmake.linux-x86_64-cpython-3.10/bin/triton-opt+0x483b61e)
#2 0x00005633d27e60b5 SignalHandler(int, siginfo_t*, void*) Signals.cpp:0:0
#3 0x00007fbcfefb1520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
#4 0x00007fbcff0059fc pthread_kill (/lib/x86_64-linux-gnu/libc.so.6+0x969fc)
#5 0x00007fbcfefb1476 gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x42476)
#6 0x00007fbcfef977f3 abort (/lib/x86_64-linux-gnu/libc.so.6+0x287f3)
#7 0x00005633d2793231 (/home/jovyan/intel-xpu-backend-for-triton/build/cmake.linux-x86_64-cpython-3.10/bin/triton-opt+0x47eb231)
#8 0x00005633ce562bf1 mlir::triton::getMakeTensorPtrOp(mlir::Value) /home/jovyan/intel-xpu-backend-for-triton/lib/Dialect/Triton/IR/Utility.cpp:93:1
#9 0x00005633ce562639 getMakeTensorPtrOpImpl(mlir::Operation*, mlir::Value) /home/jovyan/intel-xpu-backend-for-triton/lib/Dialect/Triton/IR/Utility.cpp:30:34
#10 0x00005633ce562989 mlir::triton::getMakeTensorPtrOp(mlir::Value) /home/jovyan/intel-xpu-backend-for-triton/lib/Dialect/Triton/IR/Utility.cpp:69:34
#11 0x00005633cebcf32e mlir::triton::gpu::intel::TritonIntelGPUOptimizeBlockIOEncodingPass::getSubgroup2DBlockLayoutForOperand(mlir::Value, mlir::triton::gpu::intel::DpasEncodingAttr, llvm::MapVector<mlir::Operation*, mlir::Attribute, llvm::DenseMap<mlir::Operation*, unsigned int, llvm::DenseMapInfo<mlir::Operation*, void>, llvm::detail::DenseMapPair<mlir::Operation*, unsigned int>>, llvm::SmallVector<std::pair<mlir::Operation*, mlir::Attribute>, 0u>>&) /home/jovyan/intel-xpu-backend-for-triton/third_party/intel/lib/TritonIntelGPUTransforms/OptimizeBlockIOEncoding.cpp:245:57
#12 0x00005633cebcf8a3 mlir::triton::gpu::intel::TritonIntelGPUOptimizeBlockIOEncodingPass::runOnOperation()::'lambda'(mlir::triton::DotOp)::operator()(mlir::triton::DotOp) const /home/jovyan/intel-xpu-backend-for-triton/third_party/intel/lib/TritonIntelGPUTransforms/OptimizeBlockIOEncoding.cpp:300:41
#13 0x00005633cebd253f _ZZN4mlir6detail4walkILNS_9WalkOrderE1ENS_15ForwardIteratorEZNS_6triton3gpu5intel41TritonIntelGPUOptimizeBlockIOEncodingPass14runOnOperationEvEUlNS4_5DotOpEE_S8_vEENSt9enable_ifIXaantsrSt11disjunctionIJSt7is_sameIT2_PNS_9OperationEESC_ISD_PNS_6RegionEESC_ISD_PNS_5BlockEEEE5valuesrSC_IT3_vE5valueESO_E4typeESF_OT1_ENKUlSF_E_clESF_ /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/Visitors.h:336:20
#14 0x00005633cebd3c98 _ZN4llvm12function_refIFvPN4mlir9OperationEEE11callback_fnIZNS1_6detail4walkILNS1_9WalkOrderE1ENS1_15ForwardIteratorEZNS1_6triton3gpu5intel41TritonIntelGPUOptimizeBlockIOEncodingPass14runOnOperationEvEUlNSB_5DotOpEE_SF_vEENSt9enable_ifIXaantsrSt11disjunctionIJSt7is_sameIT2_S3_ESJ_ISK_PNS1_6RegionEESJ_ISK_PNS1_5BlockEEEE5valuesrSJ_IT3_vE5valueEST_E4typeES3_OT1_EUlS3_E_EEvlS3_ /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/llvm/ADT/STLFunctionalExtras.h:47:40
#15 0x00005633ce354d47 llvm::function_ref<void (mlir::Operation*)>::operator()(mlir::Operation*) const /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/llvm/ADT/STLFunctionalExtras.h:69:62
#16 0x00005633ce352580 void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/Visitors.h:187:1
#17 0x00005633ce3524ef void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/Visitors.h:179:7
#18 0x00005633ce3524ef void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/Visitors.h:179:7
#19 0x00005633ce3524ef void mlir::detail::walk<mlir::ForwardIterator>(mlir::Operation*, llvm::function_ref<void (mlir::Operation*)>, mlir::WalkOrder) /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/Visitors.h:179:7
#20 0x00005633cebd25b4 _ZN4mlir6detail4walkILNS_9WalkOrderE1ENS_15ForwardIteratorEZNS_6triton3gpu5intel41TritonIntelGPUOptimizeBlockIOEncodingPass14runOnOperationEvEUlNS4_5DotOpEE_S8_vEENSt9enable_ifIXaantsrSt11disjunctionIJSt7is_sameIT2_PNS_9OperationEESC_ISD_PNS_6RegionEESC_ISD_PNS_5BlockEEEE5valuesrSC_IT3_vE5valueESO_E4typeESF_OT1_ /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/Visitors.h:341:38
#21 0x00005633cebd12f0 _ZN4mlir9Operation4walkILNS_9WalkOrderE1ENS_15ForwardIteratorEZNS_6triton3gpu5intel41TritonIntelGPUOptimizeBlockIOEncodingPass14runOnOperationEvEUlNS4_5DotOpEE_vEENSt9enable_ifIXeqsrN4llvm15function_traitsINSt5decayIT1_E4typeEXsrSt8is_classISG_E5valueEEE8num_argsLi1EET2_E4typeEOSE_ /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/Operation.h:798:75
#22 0x00005633cebd019f _ZN4mlir7OpState4walkILNS_9WalkOrderE1ENS_15ForwardIteratorEZNS_6triton3gpu5intel41TritonIntelGPUOptimizeBlockIOEncodingPass14runOnOperationEvEUlNS4_5DotOpEE_vEENSt9enable_ifIXeqsrN4llvm15function_traitsINSt5decayIT1_E4typeEXsrSt8is_classISG_E5valueEEE8num_argsLi1EET2_E4typeEOSE_ /home/jovyan/.triton/llvm/llvm-8957e64a-ubuntu-x64/include/mlir/IR/OpDefinition.h:169:68
#23 0x00005633cebcf962 mlir::triton::gpu::intel::TritonIntelGPUOptimizeBlockIOEncodingPass::runOnOperation() /home/jovyan/intel-xpu-backend-for-triton/third_party/intel/lib/TritonIntelGPUTransforms/OptimizeBlockIOEncoding.cpp:311:21
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is probably failing because the advance occurs before the load and the argument to advance is not MakeTensorPtrOp but the while loop construction. Why do you expect this to work? Can this ever be generated from a Triton kernel written in Python? I have run this PR on all the tests and benchmarks and not seen this error - so if it legal then we are missing unit test coverage for this pattern.
09fec1e to
37e5f75
Compare
third_party/intel/include/Dialect/TritonIntelGPU/Transforms/Passes.td
Outdated
Show resolved
Hide resolved
| namespace { | ||
|
|
||
| SmallVector<Value> getTiedArgs(Operation *op, int resultIdx) { | ||
| if (auto forOp = dyn_cast<scf::ForOp>(op)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoopLikeOpInterface should be get loop info with common name.
third_party/intel/lib/TritonIntelGPUTransforms/OptimizeBlockIOEncoding.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/OptimizeBlockIOEncoding.cpp
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/OptimizeBlockIOEncoding.cpp
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/OptimizeBlockIOEncoding.cpp
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/OptimizeBlockIOEncoding.cpp
Show resolved
Hide resolved
| << itr->second.desiredEncoding << " for value " | ||
| << typedVal << ". Ensure new encoding " | ||
| << info.desiredEncoding << " matches.\n"); | ||
| assert(itr->second == info && "already visited encoding info for " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the assertion is not triggered as only a subset of uses kind are considered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but if a use is missed wouldn't we see the IR fail to verify?
third_party/intel/lib/TritonIntelGPUTransforms/OptimizeBlockIOEncoding.cpp
Outdated
Show resolved
Hide resolved
|
FYI |
|
I added a test with more complex control flow - I also wanted to hunt down the source of the crash @etiotto posted above - it turns out that it is a result of upstream code that assumes all |
|
Thanks @etiotto and @whitneywhtsang for your comments - I have increased the testing to show more structured control flow examples to include if, while, and for loops. While doing this I discovered a bug in the getTiedArgs function which I fixed, and which is also merged upstream. I also fixed an issue in getMakeTensorPtr for while loops which I have PRed separately (#4567) but included and tested here (I think it would be nice to have that commit be logically separate). Finally, I added support for |
Properly handle while loop in `getMakeTensorPtr`. This is common code, but given upstream is deprecating the feature I don't think they want to take this change. The test for this is in #4463 but I think it is better to keep the commit logically separate and not have it squashed and merged with #4463.
05dab1c to
a915a90
Compare
Merged to our repo. |
|
We need the bug fix too: triton-lang/triton@e71689d |
third_party/intel/lib/TritonIntelGPUTransforms/OptimizeBlockIOEncoding.cpp
Show resolved
Hide resolved
| oldTensorTy.getShape(), oldTensorTy.getElementType(), newEncoding); | ||
|
|
||
| val.setType(PointerType::get(newTensorTy, oldType.getAddressSpace())); | ||
| if (einfo.requiresConvert) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the scenario that requiresConvert is false, and there exists a user of type LoadOp? wonder if requiresConvert is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of the algorithm in the pass is to attach new encoding info to each value with the initial propagation and then update all values in the second pass. It seemed like requiresConvert fit more naturally in the first phase, where the encoding info is being determined and propagated. But, I think you are right that we can look at all users of each Value in the second phase and add the convert if any one of the users is a load.
I think I prefer storing requiresConvert in the first phase - both because it seems to fit more naturally there and because it serves as an additional sanity check on the algorithm (if we see the same Value again, we check to make sure requiresConvert matches the last time we saw that Value). I suppose we could store the index of the User in requiresConvert instead of making it a boolean, and avoid the second Users loop in phase 2, but the users loop doesn't seem too expensive. I also think we could extend the use of this pass to propagate transpose layouts, too, so having requiresConvert, or at least having the machinery to process additional properties, may be useful.
third_party/intel/lib/TritonIntelGPUTransforms/OptimizeBlockIOEncoding.cpp
Outdated
Show resolved
Hide resolved
a915a90 to
fa5910c
Compare
|
For some reason I can't reply to your |
FYI merged too. |
whitneywhtsang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ARs captured from the architecture meeting:
- Rebase to utilize the upstream getTiedArgs implementation.
- Remove the
requiresConvertfield. - Add a lit test for the case where the result of an advance op is utilized by the load op.
- Ensure that the performance of tutorial 6 remains unaffected by these changes.
778c666 to
a43cbe7
Compare
a43cbe7 to
7cbe3fe
Compare
To ensure Subgroup2DBlockIO layouts survive RemoveLayoutConversions we need to identify them as anchor layouts on the
LoadOp. To do that, we need to change the load layout to be aSubgroup2DBlockIOlayout and propagate that layout as necessary through the existing IR. This PR introducesTritonIntelGPU::OptimizeBlockEncodingwhich replaces the encoding for all ptr types between the candidateLoadOpand theMakeTensorPtrOp which creates the ptr for the load. The pass is based on a similar pass from upstream,TritonNVIDIAGPU::OptimizeDescriptorEncoding, which modifies layouts in-place for tensor descriptors. To avoid modifying any non-ptr types, which should be modified usingConvertLayoutOp, we add a dummy layout conversion after theLoadOpwhich will be removed when the old load layout is removed inRemoveLayoutConversions.With this pass we are able to create the desired chain of
Subgroup2DBlockIO->ConvertLayout->DPASwhich will allow us to begin usingSubgroup2DBlockIOin the LLVM lowering and drive the layout conversion via LinearLayout objects. The pass is introduced in this PR unused, but I did include a lit test which covers the most common for loop use case. I intend to improve test coverage as we expand use of the pass.Depends on #4461Close #4362