-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building #28579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building #28579
Conversation
|
Documentation preview: https://vllm--28579.org.readthedocs.build/en/28579/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the CUDA graph padding logic, moving it from individual attention backends into the gpu_model_runner. This centralization is a good improvement for maintainability. The BatchDescriptor has also been updated to be more descriptive. While the overall direction is positive, I've identified two critical bugs in the implementation within gpu_model_runner.py that could lead to incorrect behavior or prevent CUDA graph optimizations from being applied. Please see the detailed comments for fixes.
vllm/v1/worker/gpu_model_runner.py
Outdated
| ) | ||
| uniform_decode = ( | ||
| (max_num_scheduled_tokens == self.uniform_decode_query_len) | ||
| and (num_reqs == max_num_scheduled_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition for uniform_decode seems incorrect. num_reqs == max_num_scheduled_tokens will only be true in very specific cases (e.g., a single decode request when uniform_decode_query_len is 1), preventing most uniform decode batches from being correctly identified. This will likely disable CUDA graph optimizations for decode paths.
The condition should probably check if the total number of tokens is equal to the number of requests multiplied by the query length, similar to the previous implementation.
| and (num_reqs == max_num_scheduled_tokens) | |
| and (num_tokens_unpadded == num_reqs * max_num_scheduled_tokens) |
vllm/v1/worker/gpu_model_runner.py
Outdated
| attn_metadata, spec_decode_common_attn_metadata = ( | ||
| self._build_attention_metadata( | ||
| total_num_scheduled_tokens=total_num_scheduled_tokens, | ||
| total_num_scheduled_tokens=num_reqs_padded, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total_num_scheduled_tokens argument for _build_attention_metadata is being passed num_reqs_padded, which is the number of requests. It should be num_tokens_padded, the total number of tokens. This will likely lead to incorrect attention metadata and could cause errors or incorrect model outputs.
| total_num_scheduled_tokens=num_reqs_padded, | |
| total_num_scheduled_tokens=num_tokens_padded, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
SageMoore
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good cleanup @LucasWilkinson. Thanks for the contribution.
8d3975f to
dd6ad9e
Compare
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, just a few qs above
bb224a0 to
22ab0f9
Compare
SageMoore
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks good to me, Lucas.
|
This pull request has merge conflicts that must be resolved before it can be |
a7a04ba to
2b58a28
Compare
38cac6d to
bf3731f
Compare
- Move padding calculation before CUDA graph dispatch - Update dispatch() to take uniform_decode directly instead of computing it - Remove max_num_scheduled_tokens parameter from dispatch() - Update BatchDescriptor to use 'uniform' field consistently - Fix _prepare_inputs to handle new padding flow - Update attention backends to work with new padding approach - Add documentation for BatchDescriptor fields Co-authored-by: ayushsatyam146 <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> remove files Signed-off-by: Lucas Wilkinson <[email protected]> cleanup Signed-off-by: Lucas Wilkinson <[email protected]> cleanup Signed-off-by: Lucas Wilkinson <[email protected]> review comment Signed-off-by: Lucas Wilkinson <[email protected]> remove dead code Signed-off-by: Lucas Wilkinson <[email protected]> cleanup Signed-off-by: Lucas Wilkinson <[email protected]> fix doc error Signed-off-by: Lucas Wilkinson <[email protected]> cleanup Signed-off-by: Lucas Wilkinson <[email protected]> wip Signed-off-by: Lucas Wilkinson <[email protected]> clean-up Signed-off-by: Lucas Wilkinson <[email protected]> cleanup Signed-off-by: Lucas Wilkinson <[email protected]> cleanup Signed-off-by: Lucas Wilkinson <[email protected]> wip Signed-off-by: Lucas Wilkinson <[email protected]> pad ubatches Signed-off-by: Lucas Wilkinson <[email protected]> test fixes Signed-off-by: Lucas Wilkinson <[email protected]> fix CPU backend Signed-off-by: Lucas Wilkinson <[email protected]> fix typo Signed-off-by: Lucas Wilkinson <[email protected]> Update vllm/v1/worker/gpu_model_runner.py Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> format Signed-off-by: Lucas Wilkinson <[email protected]> fix mamba Signed-off-by: Lucas Wilkinson <[email protected]> cleanup Signed-off-by: Lucas Wilkinson <[email protected]> format Signed-off-by: Lucas Wilkinson <[email protected]> fix mypy Signed-off-by: Lucas Wilkinson <[email protected]> test fix Signed-off-by: Lucas Wilkinson <[email protected]> more test fixes Signed-off-by: Lucas Wilkinson <[email protected]> test fix Signed-off-by: Lucas Wilkinson <[email protected]> fix Signed-off-by: Lucas Wilkinson <[email protected]> fix Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
This reverts commit 993ca5a.
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
b6707c4 to
89f0ca7
Compare
mgoin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification. Looks good to me!
|
it seems this pr adds some runtime overhead: #29760 |
…n metadata building (vllm-project#28579) Signed-off-by: Benjamin Feuer <[email protected]>
…n metadata building (vllm-project#28579)
1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <[email protected]> Co-authored-by: wangli <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: hfadzxy <[email protected]> Co-authored-by: wangli <[email protected]> Co-authored-by: hfadzxy <[email protected]>
| ) | ||
|
|
||
| ubatch_slices, num_tokens_across_dp = coordinate_batch_across_dp( | ||
| num_tokens_unpadded=num_tokens_padded, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_tokens_unpadded = num_tokens?
1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <[email protected]> Co-authored-by: wangli <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: hfadzxy <[email protected]> Co-authored-by: wangli <[email protected]> Co-authored-by: hfadzxy <[email protected]>
FIX #23789
The goal of this PR is to:
- update to the latest FA3 (FA3 variable length attention sort/swizzle flash-attention#82)
- remove hacks like: vLLM Easier cudagraph integration FlashMLA#3
- remove
pad_for_cudagraphsfrom attention backends; this is done for FlashInfer but will be done forGDNAttentionBackend,Mamba1AttentionBackend,Mamba2AttentionMetadata,ShortConvAttentionBackendin future PRspad_for_cudagraphswas called multiple times insideexecute_modelbefore the forward pass making it challenging to reason about the padding order. This PR starts to make the padding order in gpu_model_runner clearer but more work still needs to be done.Future related work that will be based off this PR:
pad_for_cudagraphsfrom attention backends (see 1)pad_for_cudagraphsfrom config; transferring ownership to CUDAGraphDispatcher- this will make it easier to have seperate cudagraph sizes for FULL and PIECEWISE; important for a more robust and long term solution to: [Bug]: CUDA Graph Capture Issue: Unexpected Prefill Branches in Uniform Decode Graphs when MTP=2 #28207
Shout-out to @ayushsatyam146 for the preliminary work in #24002
Co-authored-by: ayushsatyam146 [email protected]