-
Notifications
You must be signed in to change notification settings - Fork 75
[BACKEND] Enhance the remove layout implementation to reduce the duplicated values with different layout in scf.for. #4527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
486ed4a to
f42bd66
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the remove layout implementation to better handle layout propagation across scf.for operations, addressing limitations that create performance bottlenecks on Intel GPU. The changes focus on reducing duplicated layout conversion operations by improving support for multi-result operations and nested basic blocks.
- Adds support for propagating layouts through
scf.foroperations with a newincludeForOpparameter - Refactors
mappedValuesto handle multiple attribute mappings per value instead of single mappings - Includes debug output and unreachable code handling for
scf.foroperations
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| Utility.h | Adds includeForOp parameter to getConvertBackwardSlice function signature |
| Utility.cpp | Implements scf.for layout propagation logic with early return check and debug output |
| RemoveLayoutConversions.cpp | Updates data structures to support multiple encodings per value and enables scf.for processing |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp
Outdated
Show resolved
Hide resolved
f42bd66 to
7a8f858
Compare
|
The flex attn backward ttgir has been simplified by these changes. There are only two root tiling layout of the dpas and the transpose of dot of dpas. Another major in-efficient issue on Xe-Xe3 is that the regular pointer under different layout like: The simplified ttgir |
7a8f858 to
5828b55
Compare
I checkout this branch and run the benchmark code in flex attn bwdBut the TTGI I get is not the same as the one you mentioned, the IR around the Snippet from TTGIRThe IR still contains the convert_layout operations and there are other ops (e.g. ttg.local_alloc) which do not appear for you. How did you compile the benchmark? Does this PR contains all of your code ? |
5828b55 to
d85a2ba
Compare
|
@etiotto Added the missed changes for debug the backward. |
…d values with different layout in scf.for. Signed-off-by: Lu,Chengjun <[email protected]>
Signed-off-by: Lu,Chengjun <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp
Outdated
Show resolved
Hide resolved
third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp
Outdated
Show resolved
Hide resolved
…ntel/intel-xpu-backend-for-triton into chengjun/enhance_remove_layout
…ntel/intel-xpu-backend-for-triton into chengjun/enhance_remove_layout
Signed-off-by: Ettore Tiotto <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
third_party/intel/lib/TritonIntelGPUTransforms/RemoveLayoutConversions.cpp
Outdated
Show resolved
Hide resolved
| LogicalResult result = getRematerializableSlice(convertOp.getSrcMutable(), | ||
| targetType.getEncoding(), | ||
| slice, layout, nullptr); |
Copilot
AI
Sep 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Adding a nullptr parameter without documentation or context makes the API unclear. Consider adding a comment explaining what this parameter represents or using a named constant instead of nullptr.
| LogicalResult result = getRematerializableSlice(convertOp.getSrcMutable(), | |
| targetType.getEncoding(), | |
| slice, layout, nullptr); | |
| // No filter function is provided, so we pass kNoRematerializationFilter (nullptr). | |
| static constexpr auto kNoRematerializationFilter = nullptr; | |
| LogicalResult result = getRematerializableSlice(convertOp.getSrcMutable(), | |
| targetType.getEncoding(), | |
| slice, layout, kNoRematerializationFilter); |
Signed-off-by: Ettore Tiotto <[email protected]>
Signed-off-by: Ettore Tiotto <[email protected]>

The layout propagation across the
scf.forop in RemoveLayout is not implemented well for these aspects:scf.forops.With the limitations, the
scf.foroperation is the bottle neck of the efficient after the remove layout pass.This is not issue on NV GPU because the NV GPU convert the layout convert operations to async.cp in software pipeline.
But it is an issue for Intel GPU. We rely on the remove layout to get a simple program with less convert layout operations.
Plan to enhance the remove layout to enhance the limitations of the remove layout.
scf.forops on demand.This is an PR for CI.