Skip to content

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Nov 12, 2025

FIX #23789

The goal of this PR is to:

  1. Pad for cudagraphs before building attention metadata; this will allow us to
    - update to the latest FA3 (FA3 variable length attention sort/swizzle flash-attention#82)
    - remove hacks like: vLLM Easier cudagraph integration FlashMLA#3
    - remove pad_for_cudagraphs from attention backends; this is done for FlashInfer but will be done for GDNAttentionBackend, Mamba1AttentionBackend, Mamba2AttentionMetadata, ShortConvAttentionBackend in future PRs
  2. Pad for cudagraphs in less places; prior to this PR pad_for_cudagraphs was called multiple times inside execute_model before the forward pass making it challenging to reason about the padding order. This PR starts to make the padding order in gpu_model_runner clearer but more work still needs to be done.
  3. Generally make the padding logic more isolated and easier to reason about

Future related work that will be based off this PR:

Shout-out to @ayushsatyam146 for the preliminary work in #24002
Co-authored-by: ayushsatyam146 [email protected]

@mergify
Copy link

mergify bot commented Nov 12, 2025

Documentation preview: https://vllm--28579.org.readthedocs.build/en/28579/

@mergify mergify bot added documentation Improvements or additions to documentation nvidia v1 labels Nov 12, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the CUDA graph padding logic, moving it from individual attention backends into the gpu_model_runner. This centralization is a good improvement for maintainability. The BatchDescriptor has also been updated to be more descriptive. While the overall direction is positive, I've identified two critical bugs in the implementation within gpu_model_runner.py that could lead to incorrect behavior or prevent CUDA graph optimizations from being applied. Please see the detailed comments for fixes.

)
uniform_decode = (
(max_num_scheduled_tokens == self.uniform_decode_query_len)
and (num_reqs == max_num_scheduled_tokens)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The condition for uniform_decode seems incorrect. num_reqs == max_num_scheduled_tokens will only be true in very specific cases (e.g., a single decode request when uniform_decode_query_len is 1), preventing most uniform decode batches from being correctly identified. This will likely disable CUDA graph optimizations for decode paths.

The condition should probably check if the total number of tokens is equal to the number of requests multiplied by the query length, similar to the previous implementation.

Suggested change
and (num_reqs == max_num_scheduled_tokens)
and (num_tokens_unpadded == num_reqs * max_num_scheduled_tokens)

attn_metadata, spec_decode_common_attn_metadata = (
self._build_attention_metadata(
total_num_scheduled_tokens=total_num_scheduled_tokens,
total_num_scheduled_tokens=num_reqs_padded,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The total_num_scheduled_tokens argument for _build_attention_metadata is being passed num_reqs_padded, which is the number of requests. It should be num_tokens_padded, the total number of tokens. This will likely lead to incorrect attention metadata and could cause errors or incorrect model outputs.

Suggested change
total_num_scheduled_tokens=num_reqs_padded,
total_num_scheduled_tokens=num_tokens_padded,

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@LucasWilkinson LucasWilkinson changed the title [Core] Pad for CUDA graphs before attention metadata building and refactor padding logic [Core] Refactor padding logic and pad for CUDA graphs before attention metadata building and Nov 13, 2025
@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2025
@LucasWilkinson LucasWilkinson changed the title [Core] Refactor padding logic and pad for CUDA graphs before attention metadata building and [Core] Refactor padding logic and pad for CUDA graphs before attention metadata building Nov 13, 2025
Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good cleanup @LucasWilkinson. Thanks for the contribution.

@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/padding-refactor branch 3 times, most recently from 8d3975f to dd6ad9e Compare November 17, 2025 20:21
Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, just a few qs above

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Nov 17, 2025
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/padding-refactor branch from bb224a0 to 22ab0f9 Compare November 18, 2025 15:34
Copy link
Contributor

@SageMoore SageMoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me, Lucas.

@mergify
Copy link

mergify bot commented Nov 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 19, 2025
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/padding-refactor branch 2 times, most recently from a7a04ba to 2b58a28 Compare November 21, 2025 04:55
@mergify mergify bot removed the needs-rebase label Nov 21, 2025
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/padding-refactor branch from 38cac6d to bf3731f Compare November 22, 2025 06:00
- Move padding calculation before CUDA graph dispatch
- Update dispatch() to take uniform_decode directly instead of computing it
- Remove max_num_scheduled_tokens parameter from dispatch()
- Update BatchDescriptor to use 'uniform' field consistently
- Fix _prepare_inputs to handle new padding flow
- Update attention backends to work with new padding approach
- Add documentation for BatchDescriptor fields

Co-authored-by: ayushsatyam146 <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>

remove files

Signed-off-by: Lucas Wilkinson <[email protected]>

cleanup

Signed-off-by: Lucas Wilkinson <[email protected]>

cleanup

Signed-off-by: Lucas Wilkinson <[email protected]>

review comment

Signed-off-by: Lucas Wilkinson <[email protected]>

remove dead code

Signed-off-by: Lucas Wilkinson <[email protected]>

cleanup

Signed-off-by: Lucas Wilkinson <[email protected]>

fix doc error

Signed-off-by: Lucas Wilkinson <[email protected]>

cleanup

Signed-off-by: Lucas Wilkinson <[email protected]>

wip

Signed-off-by: Lucas Wilkinson <[email protected]>

clean-up

Signed-off-by: Lucas Wilkinson <[email protected]>

cleanup

Signed-off-by: Lucas Wilkinson <[email protected]>

cleanup

Signed-off-by: Lucas Wilkinson <[email protected]>

wip

Signed-off-by: Lucas Wilkinson <[email protected]>

pad ubatches

Signed-off-by: Lucas Wilkinson <[email protected]>

test fixes

Signed-off-by: Lucas Wilkinson <[email protected]>

fix CPU backend

Signed-off-by: Lucas Wilkinson <[email protected]>

fix typo

Signed-off-by: Lucas Wilkinson <[email protected]>

Update vllm/v1/worker/gpu_model_runner.py

Co-authored-by: Luka Govedič <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>

format

Signed-off-by: Lucas Wilkinson <[email protected]>

fix mamba

Signed-off-by: Lucas Wilkinson <[email protected]>

cleanup

Signed-off-by: Lucas Wilkinson <[email protected]>

format

Signed-off-by: Lucas Wilkinson <[email protected]>

fix mypy

Signed-off-by: Lucas Wilkinson <[email protected]>

test fix

Signed-off-by: Lucas Wilkinson <[email protected]>

more test fixes

Signed-off-by: Lucas Wilkinson <[email protected]>

test fix

Signed-off-by: Lucas Wilkinson <[email protected]>

fix

Signed-off-by: Lucas Wilkinson <[email protected]>

fix

Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
This reverts commit 993ca5a.
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
@LucasWilkinson LucasWilkinson force-pushed the lwilkinson/padding-refactor branch from b6707c4 to 89f0ca7 Compare November 26, 2025 04:35
Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification. Looks good to me!

@mgoin mgoin merged commit 56539cd into vllm-project:main Nov 26, 2025
53 checks passed
@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Nov 26, 2025
@BoyuanFeng
Copy link
Contributor

it seems this pr adds some runtime overhead: #29760

penfever pushed a commit to mlfoundations/vllm that referenced this pull request Dec 1, 2025
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
wangxiyuan added a commit to vllm-project/vllm-ascend that referenced this pull request Dec 2, 2025
1. fix vllm-project/vllm#28542
The model structure modifications we involved in are:
     - Qwen2.5-VL(still exist some patch)
     - Qwen2-VL
     - Qwen2
     - DeepSeek series
     - Qwen-moe series
2. fix vllm-project/vllm#29121
   the output token now  type changed from np to `list[list[int]]`

3. fix vllm-project/vllm#29262
    `xformers` backend for multimodal now has been deprecated
4. fix vllm-project/vllm#29342

5. fix vllm-project/vllm#28579
6. fix vllm-project/vllm#28718
7. fix vllm-project/vllm#28665
8. fix vllm-project/vllm#26847
vllm introduced the `optimization-level`, some default config has been
changed, and the param `--enforce-eager` has been deprecated
9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple
for sampler.
10. fix vllm-project/vllm#29471 we'll remove the
related patch to avoid this kind of error.

Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: wangli <[email protected]>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangli <[email protected]>
Signed-off-by: hfadzxy <[email protected]>
Co-authored-by: wangli <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
)

ubatch_slices, num_tokens_across_dp = coordinate_batch_across_dp(
num_tokens_unpadded=num_tokens_padded,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_tokens_unpadded = num_tokens?

ChenCangtao pushed a commit to ChenCangtao/vllm-ascend that referenced this pull request Dec 3, 2025
1. fix vllm-project/vllm#28542
The model structure modifications we involved in are:
     - Qwen2.5-VL(still exist some patch)
     - Qwen2-VL
     - Qwen2
     - DeepSeek series
     - Qwen-moe series
2. fix vllm-project/vllm#29121
   the output token now  type changed from np to `list[list[int]]`

3. fix vllm-project/vllm#29262
    `xformers` backend for multimodal now has been deprecated
4. fix vllm-project/vllm#29342

5. fix vllm-project/vllm#28579
6. fix vllm-project/vllm#28718
7. fix vllm-project/vllm#28665
8. fix vllm-project/vllm#26847
vllm introduced the `optimization-level`, some default config has been
changed, and the param `--enforce-eager` has been deprecated
9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple
for sampler.
10. fix vllm-project/vllm#29471 we'll remove the
related patch to avoid this kind of error.

Co-authored-by: hfadzxy <[email protected]>
Co-authored-by: wangli <[email protected]>


- vLLM version: v0.11.2

---------

Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangli <[email protected]>
Signed-off-by: hfadzxy <[email protected]>
Co-authored-by: wangli <[email protected]>
Co-authored-by: hfadzxy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Attention]: Pad for cudagraphs before constructing attention metadata

6 participants