[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building #28579

LucasWilkinson · 2025-11-12T19:05:20Z

FIX #23789

The goal of this PR is to:

Pad for cudagraphs before building attention metadata; this will allow us to
- update to the latest FA3 (FA3 variable length attention sort/swizzle flash-attention#82)
- remove hacks like: vLLM Easier cudagraph integration FlashMLA#3
- remove pad_for_cudagraphs from attention backends; this is done for FlashInfer but will be done for GDNAttentionBackend, Mamba1AttentionBackend, Mamba2AttentionMetadata, ShortConvAttentionBackend in future PRs
Pad for cudagraphs in less places; prior to this PR pad_for_cudagraphs was called multiple times inside execute_model before the forward pass making it challenging to reason about the padding order. This PR starts to make the padding order in gpu_model_runner clearer but more work still needs to be done.
Generally make the padding logic more isolated and easier to reason about

Future related work that will be based off this PR:

remove pad_for_cudagraphs from attention backends (see 1)
remove pad_for_cudagraphs from config; transferring ownership to CUDAGraphDispatcher
- this will make it easier to have seperate cudagraph sizes for FULL and PIECEWISE; important for a more robust and long term solution to: [Bug]: CUDA Graph Capture Issue: Unexpected Prefill Branches in Uniform Decode Graphs when MTP=2 #28207
refactor dummy_run to seperate cudagraph capture

Shout-out to @ayushsatyam146 for the preliminary work in #24002
Co-authored-by: ayushsatyam146 [email protected]

mergify · 2025-11-12T19:05:58Z

Documentation preview: https://vllm--28579.org.readthedocs.build/en/28579/

gemini-code-assist

Code Review

This pull request refactors the CUDA graph padding logic, moving it from individual attention backends into the gpu_model_runner. This centralization is a good improvement for maintainability. The BatchDescriptor has also been updated to be more descriptive. While the overall direction is positive, I've identified two critical bugs in the implementation within gpu_model_runner.py that could lead to incorrect behavior or prevent CUDA graph optimizations from being applied. Please see the detailed comments for fixes.

gemini-code-assist · 2025-11-12T19:06:50Z

vllm/v1/worker/gpu_model_runner.py

+                )
+                uniform_decode = (
+                    (max_num_scheduled_tokens == self.uniform_decode_query_len)
+                    and (num_reqs == max_num_scheduled_tokens)


The condition for uniform_decode seems incorrect. num_reqs == max_num_scheduled_tokens will only be true in very specific cases (e.g., a single decode request when uniform_decode_query_len is 1), preventing most uniform decode batches from being correctly identified. This will likely disable CUDA graph optimizations for decode paths.

The condition should probably check if the total number of tokens is equal to the number of requests multiplied by the query length, similar to the previous implementation.

Suggested change

and (num_reqs == max_num_scheduled_tokens)

and (num_tokens_unpadded == num_reqs * max_num_scheduled_tokens)

gemini-code-assist · 2025-11-12T19:06:50Z

vllm/v1/worker/gpu_model_runner.py

                attn_metadata, spec_decode_common_attn_metadata = (
                    self._build_attention_metadata(
-                        total_num_scheduled_tokens=total_num_scheduled_tokens,
+                        total_num_scheduled_tokens=num_reqs_padded,


The total_num_scheduled_tokens argument for _build_attention_metadata is being passed num_reqs_padded, which is the number of requests. It should be num_tokens_padded, the total number of tokens. This will likely lead to incorrect attention metadata and could cause errors or incorrect model outputs.

Suggested change

total_num_scheduled_tokens=num_reqs_padded,

total_num_scheduled_tokens=num_tokens_padded,

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

vllm/v1/worker/gpu_model_runner.py

vllm/v1/attention/backends/utils.py

SageMoore

Looks like a good cleanup @LucasWilkinson. Thanks for the contribution.

vllm/forward_context.py

vllm/v1/worker/gpu_model_runner.py

vllm/v1/cudagraph_dispatcher.py

vllm/v1/worker/gpu_model_runner.py

ProExpertProg

LGTM overall, just a few qs above

SageMoore

This all looks good to me, Lucas.

mergify · 2025-11-19T16:16:53Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

- Move padding calculation before CUDA graph dispatch - Update dispatch() to take uniform_decode directly instead of computing it - Remove max_num_scheduled_tokens parameter from dispatch() - Update BatchDescriptor to use 'uniform' field consistently - Fix _prepare_inputs to handle new padding flow - Update attention backends to work with new padding approach - Add documentation for BatchDescriptor fields Co-authored-by: ayushsatyam146 <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> remove files Signed-off-by: Lucas Wilkinson <[email protected]> cleanup Signed-off-by: Lucas Wilkinson <[email protected]> cleanup Signed-off-by: Lucas Wilkinson <[email protected]> review comment Signed-off-by: Lucas Wilkinson <[email protected]> remove dead code Signed-off-by: Lucas Wilkinson <[email protected]> cleanup Signed-off-by: Lucas Wilkinson <[email protected]> fix doc error Signed-off-by: Lucas Wilkinson <[email protected]> cleanup Signed-off-by: Lucas Wilkinson <[email protected]> wip Signed-off-by: Lucas Wilkinson <[email protected]> clean-up Signed-off-by: Lucas Wilkinson <[email protected]> cleanup Signed-off-by: Lucas Wilkinson <[email protected]> cleanup Signed-off-by: Lucas Wilkinson <[email protected]> wip Signed-off-by: Lucas Wilkinson <[email protected]> pad ubatches Signed-off-by: Lucas Wilkinson <[email protected]> test fixes Signed-off-by: Lucas Wilkinson <[email protected]> fix CPU backend Signed-off-by: Lucas Wilkinson <[email protected]> fix typo Signed-off-by: Lucas Wilkinson <[email protected]> Update vllm/v1/worker/gpu_model_runner.py Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Lucas Wilkinson <[email protected]> format Signed-off-by: Lucas Wilkinson <[email protected]> fix mamba Signed-off-by: Lucas Wilkinson <[email protected]> cleanup Signed-off-by: Lucas Wilkinson <[email protected]> format Signed-off-by: Lucas Wilkinson <[email protected]> fix mypy Signed-off-by: Lucas Wilkinson <[email protected]> test fix Signed-off-by: Lucas Wilkinson <[email protected]> more test fixes Signed-off-by: Lucas Wilkinson <[email protected]> test fix Signed-off-by: Lucas Wilkinson <[email protected]> fix Signed-off-by: Lucas Wilkinson <[email protected]> fix Signed-off-by: Lucas Wilkinson <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]>

This reverts commit 993ca5a.

Signed-off-by: Lucas Wilkinson <[email protected]>

mgoin

Thanks for the clarification. Looks good to me!

BoyuanFeng · 2025-12-01T01:35:45Z

it seems this pr adds some runtime overhead: #29760

…n metadata building (vllm-project#28579) Signed-off-by: Benjamin Feuer <[email protected]>

…n metadata building (vllm-project#28579)

1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <[email protected]> Co-authored-by: wangli <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: hfadzxy <[email protected]> Co-authored-by: wangli <[email protected]> Co-authored-by: hfadzxy <[email protected]>

hidva · 2025-12-02T14:47:02Z

vllm/v1/worker/gpu_model_runner.py

+            )
+
+            ubatch_slices, num_tokens_across_dp = coordinate_batch_across_dp(
+                num_tokens_unpadded=num_tokens_padded,


num_tokens_unpadded = num_tokens?

1. fix vllm-project/vllm#28542 The model structure modifications we involved in are: - Qwen2.5-VL(still exist some patch) - Qwen2-VL - Qwen2 - DeepSeek series - Qwen-moe series 2. fix vllm-project/vllm#29121 the output token now type changed from np to `list[list[int]]` 3. fix vllm-project/vllm#29262 `xformers` backend for multimodal now has been deprecated 4. fix vllm-project/vllm#29342 5. fix vllm-project/vllm#28579 6. fix vllm-project/vllm#28718 7. fix vllm-project/vllm#28665 8. fix vllm-project/vllm#26847 vllm introduced the `optimization-level`, some default config has been changed, and the param `--enforce-eager` has been deprecated 9. fix http://github.com/vllm-project/vllm/pull/29223 it retuns tuple for sampler. 10. fix vllm-project/vllm#29471 we'll remove the related patch to avoid this kind of error. Co-authored-by: hfadzxy <[email protected]> Co-authored-by: wangli <[email protected]> - vLLM version: v0.11.2 --------- Signed-off-by: wangxiyuan <[email protected]> Signed-off-by: wangli <[email protected]> Signed-off-by: hfadzxy <[email protected]> Co-authored-by: wangli <[email protected]> Co-authored-by: hfadzxy <[email protected]>

LucasWilkinson requested review from mgoin and pavanimajety as code owners November 12, 2025 19:05

mergify bot added documentation Improvements or additions to documentation nvidia v1 labels Nov 12, 2025

github-project-automation bot added this to NVIDIA Nov 12, 2025

gemini-code-assist bot reviewed Nov 12, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 12, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/utils.py Outdated Show resolved Hide resolved

LucasWilkinson changed the title ~~[Core] Pad for CUDA graphs before attention metadata building and refactor padding logic~~ [Core] Refactor padding logic and pad for CUDA graphs before attention metadata building and Nov 13, 2025

LucasWilkinson requested review from ProExpertProg and njhill November 13, 2025 02:38

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 13, 2025

LucasWilkinson changed the title ~~[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building and~~ [Core] Refactor padding logic and pad for CUDA graphs before attention metadata building Nov 13, 2025

SageMoore reviewed Nov 14, 2025

View reviewed changes

vllm/forward_context.py Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

LucasWilkinson force-pushed the lwilkinson/padding-refactor branch 3 times, most recently from 8d3975f to dd6ad9e Compare November 17, 2025 20:21

ProExpertProg reviewed Nov 17, 2025

View reviewed changes

vllm/v1/cudagraph_dispatcher.py Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

ProExpertProg approved these changes Nov 17, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Nov 17, 2025

LucasWilkinson force-pushed the lwilkinson/padding-refactor branch from bb224a0 to 22ab0f9 Compare November 18, 2025 15:34

SageMoore approved these changes Nov 19, 2025

View reviewed changes

mergify bot added the needs-rebase label Nov 19, 2025

LucasWilkinson force-pushed the lwilkinson/padding-refactor branch 2 times, most recently from a7a04ba to 2b58a28 Compare November 21, 2025 04:55

mergify bot removed the needs-rebase label Nov 21, 2025

LucasWilkinson mentioned this pull request Nov 21, 2025

Refactor: Move CUDA graph dispatch logic earlier #27382

Merged

5 tasks

LucasWilkinson force-pushed the lwilkinson/padding-refactor branch from 38cac6d to bf3731f Compare November 22, 2025 06:00

LucasWilkinson added 12 commits November 26, 2025 04:35

fix PP

c5a55a2

Signed-off-by: Lucas Wilkinson <[email protected]>

fix

5d60dbd

Signed-off-by: Lucas Wilkinson <[email protected]>

Revert "fix PP"

ca34797

This reverts commit 993ca5a.

wip

04958ac

Signed-off-by: Lucas Wilkinson <[email protected]>

wip

da1717f

Signed-off-by: Lucas Wilkinson <[email protected]>

wip

c74c4a4

Signed-off-by: Lucas Wilkinson <[email protected]>

assert

368c806

Signed-off-by: Lucas Wilkinson <[email protected]>

fix

265259e

Signed-off-by: Lucas Wilkinson <[email protected]>

fix

4268434

Signed-off-by: Lucas Wilkinson <[email protected]>

update comment

62ad56f

Signed-off-by: Lucas Wilkinson <[email protected]>

revert

89f0ca7

Signed-off-by: Lucas Wilkinson <[email protected]>

LucasWilkinson force-pushed the lwilkinson/padding-refactor branch from b6707c4 to 89f0ca7 Compare November 26, 2025 04:35

mgoin approved these changes Nov 26, 2025

View reviewed changes

mgoin merged commit 56539cd into vllm-project:main Nov 26, 2025
53 checks passed

github-project-automation bot moved this from In review to Done in NVIDIA Nov 26, 2025

This was referenced Nov 27, 2025

[BugFix] Fix new nightly failures #29578

Merged

[Attention] Make split_decodes_and_prefills(..., require_uniform=True) support padding #29644

Open

[BugFix] Fix DBO failing with TypeError: 'NoneType' object is not iterable #29698

Merged

Potabk mentioned this pull request Dec 1, 2025

[Main] Upgrade vllm commit to 2025_12_01 vllm-project/vllm-ascend#4527

Closed

wangxiyuan mentioned this pull request Dec 1, 2025

upgrade vLLM to main vllm-project/vllm-ascend#4608

Merged

penfever pushed a commit to mlfoundations/vllm that referenced this pull request Dec 1, 2025

[Core] Refactor padding logic and pad for CUDA graphs before attentio…

6259bae

…n metadata building (vllm-project#28579) Signed-off-by: Benjamin Feuer <[email protected]>

Josephasafg mentioned this pull request Dec 1, 2025

[Bugfix]: Fix RuntimeError due to wrong split in CUDAGraphs for Mamba1 #29404

Closed

5 tasks

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[Core] Refactor padding logic and pad for CUDA graphs before attentio…

a32fd83

…n metadata building (vllm-project#28579)

hidva reviewed Dec 2, 2025

View reviewed changes

This was referenced Dec 2, 2025

[BugFix] Fix assert in build_for_cudagraph_capture #29893

Merged

[BugFix] Fix DBO assert assert B_block_table == B_q #29933

Open

	and (num_reqs == max_num_scheduled_tokens)
	and (num_tokens_unpadded == num_reqs * max_num_scheduled_tokens)

	total_num_scheduled_tokens=num_reqs_padded,
	total_num_scheduled_tokens=num_tokens_padded,

Uh oh!

[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building #28579

[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building #28579

Conversation

LucasWilkinson commented Nov 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Nov 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Nov 19, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BoyuanFeng commented Dec 1, 2025

Uh oh!

hidva Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

LucasWilkinson commented Nov 12, 2025 •

edited by github-actions bot

Loading