-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[ROCm] Enable Triton ScaledMM fallback + kernel selection fix #26668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses an issue where triton_scaled_mm was not being used on ROCm by fixing the kernel selection logic. It correctly adds TritonScaledMMLinearKernel as a fallback for both ROCm and CUDA, and introduces an is_supported check to ensure kernels are compatible with the current platform. The changes are accompanied by a new integration test to verify the fix.
My review focuses on improving the robustness of the kernel selection. I've suggested making the get_min_capability check in the Triton kernel platform-aware to prevent it from being selected on unsupported ROCm hardware. Additionally, I've pointed out a confusing try-except block in the new test file that should be simplified for clarity and to avoid masking potential errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
99018da to
4d3a612
Compare
d0d088d to
9036316
Compare
vllm/model_executor/layers/quantization/kernels/scaled_mm/triton.py
Outdated
Show resolved
Hide resolved
9036316 to
2a6c86c
Compare
vllm/model_executor/layers/quantization/kernels/scaled_mm/__init__.py
Outdated
Show resolved
Hide resolved
… entry Signed-off-by: Shivam <[email protected]> Signed-off-by: Shivam <[email protected]>
be28ac6 to
d2591bf
Compare
|
Is this ready for review again? |
|
@ProExpertProg yes! |
Purpose
Fixes #14397 —
triton_scaled_mmwas never used on ROCm due to missing dispatch and checks.This PR:
Enables Triton fallback for ROCm when AITriton is unavailable
Adds Triton fallback after CUTLASS on CUDA
Implements
is_supported()checks for kernel selectionAdds a lightweight integration test validating ROCm dispatch logic
Test Plan
1. Mocked test (no GPU)
Result
2. MI300X (ROCm 7.0, vLLM built from this PR)
(a) Triton kernel functional test
(b) OpenAI-compatible API test
Then:
Response
Confirms successful end-to-end inference on ROCm.