Skip to content

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Dec 3, 2025

Since #28579 we could end up building with padded ubatch slices for PIECEWISE/NONE cudagraphs, since CUTLASS MLA assumes num_decodes_reqs == num_decode_tokens this could cause issues since only the tokens were padded; this fixes it by making sure we only use padded ubatch slices for metadata construction when using FULL cudagraphs

Failing CI: https://buildkite.com/organizations/vllm/pipelines/ci/builds/41668/jobs/019ae1ba-2178-4a96-adba-97fb4a46fe4f/log

Test Plan:

python -m pytest tests/v1/distributed/test_dbo.py was failing on B200 machines (was passing on H100s due to the backends not asserting on this)

Passes now

@LucasWilkinson LucasWilkinson added ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs labels Dec 3, 2025
@mergify mergify bot added the v1 label Dec 3, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 213 to +215
)

# Don't microbatch unless every other DP worker is also microbatching
if not should_ubatch:
return (None, num_tokens_after_padding)

# This doesn't actually pad the ubatch slices. It just initializes the
# split point to the padded value so that padding can be applied
# to the second ubatch in pad_out_ubatch_slice after attention
# metadata creation
assert num_tokens_after_padding is not None
num_tokens_padded = int(num_tokens_after_padding[0].item())
token_split_point = int(num_tokens_padded) // 2

assert num_scheduled_tokens_per_request is not None
ubatch_slices = create_ubatch_slices(
num_scheduled_tokens_per_request, token_split_point
)
ubatch_slices = _pad_out_ubatch_slice(ubatch_slices, num_tokens_padded)
assert sum(s.num_tokens for s in ubatch_slices) == num_tokens_padded

return (ubatch_slices, num_tokens_after_padding)
return (should_ubatch, num_tokens_after_padding)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Update DP coordination call sites for bool return

The new coordinate_batch_across_dp now returns a (bool, Tensor) pair (return (should_ubatch, num_tokens_after_padding)), but vllm/v1/spec_decode/eagle.py::_pad_batch_across_dp still expects the first value to be None when microbatching is disabled and asserts ubatch_slices is None. With data-parallel EAGLE runs this now receives False instead, triggering the assertion before any work is done. The call sites at eagle.py:1261/1270 need to be adjusted to the new return type to avoid failing every DP EAGLE execution.

Useful? React with 👍 / 👎.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug causing an assertion failure by ensuring that padded microbatch slices are used for metadata construction only when the cudagraph_mode is FULL. The changes are well-implemented and correctly resolve the issue.

The key modifications include:

  • Refactoring the microbatch slice creation logic from dp_utils.py to a new maybe_create_ubatch_slices function in ubatch_utils.py, which now provides both padded and unpadded slices.
  • Updating gpu_model_runner.py to conditionally use the correct slices for attention metadata based on the cudagraph_mode.
  • Simplifying coordinate_batch_across_dp to only decide whether to microbatch, not how to slice.
  • Fixing a subtle bug where a padded token count was incorrectly passed as an unpadded count.

The code is now more robust, and the refactoring improves the separation of concerns. The fix is sound, and I have no further recommendations.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug related to Dual Batch Overlap (DBO) that caused an assertion failure (assert B_block_table == B_q) with certain cudagraph modes. The root cause was incorrect handling of padded microbatch slices, particularly for PIECEWISE and NONE cudagraphs where only token slices were being padded, leading to shape mismatches in attention metadata.

The fix is comprehensive and well-structured:

  1. The logic for creating and padding microbatch slices has been refactored and centralized into a new maybe_create_ubatch_slices function in ubatch_utils.py. This function now correctly generates both padded and unpadded versions of the slices.
  2. The padding logic itself has been corrected to pad both request and token slices, resolving the shape mismatch that caused the assertion.
  3. Crucially, the gpu_model_runner now conditionally uses the padded slices for attention metadata construction only when using FULL cudagraphs, which is the correct behavior.
  4. A related bug was also fixed where sequence-parallelism-padded token counts were incorrectly used to determine microbatching thresholds.

Overall, the changes are correct, improve code clarity through refactoring, and directly address the reported issue. The test plan confirms the fix. The code quality is high.

@mergify
Copy link

mergify bot commented Dec 3, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LucasWilkinson.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
@robertgshaw2-redhat robertgshaw2-redhat merged commit c8ab988 into vllm-project:main Dec 4, 2025
136 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed ready-run-all-tests Trigger CI with all tests for wide-ranging PRs speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants