-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[BugFix] Fix DBO assert assert B_block_table == B_q
#29933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Fix DBO assert assert B_block_table == B_q
#29933
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ) | ||
|
|
||
| # Don't microbatch unless every other DP worker is also microbatching | ||
| if not should_ubatch: | ||
| return (None, num_tokens_after_padding) | ||
|
|
||
| # This doesn't actually pad the ubatch slices. It just initializes the | ||
| # split point to the padded value so that padding can be applied | ||
| # to the second ubatch in pad_out_ubatch_slice after attention | ||
| # metadata creation | ||
| assert num_tokens_after_padding is not None | ||
| num_tokens_padded = int(num_tokens_after_padding[0].item()) | ||
| token_split_point = int(num_tokens_padded) // 2 | ||
|
|
||
| assert num_scheduled_tokens_per_request is not None | ||
| ubatch_slices = create_ubatch_slices( | ||
| num_scheduled_tokens_per_request, token_split_point | ||
| ) | ||
| ubatch_slices = _pad_out_ubatch_slice(ubatch_slices, num_tokens_padded) | ||
| assert sum(s.num_tokens for s in ubatch_slices) == num_tokens_padded | ||
|
|
||
| return (ubatch_slices, num_tokens_after_padding) | ||
| return (should_ubatch, num_tokens_after_padding) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update DP coordination call sites for bool return
The new coordinate_batch_across_dp now returns a (bool, Tensor) pair (return (should_ubatch, num_tokens_after_padding)), but vllm/v1/spec_decode/eagle.py::_pad_batch_across_dp still expects the first value to be None when microbatching is disabled and asserts ubatch_slices is None. With data-parallel EAGLE runs this now receives False instead, triggering the assertion before any work is done. The call sites at eagle.py:1261/1270 need to be adjusted to the new return type to avoid failing every DP EAGLE execution.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug causing an assertion failure by ensuring that padded microbatch slices are used for metadata construction only when the cudagraph_mode is FULL. The changes are well-implemented and correctly resolve the issue.
The key modifications include:
- Refactoring the microbatch slice creation logic from
dp_utils.pyto a newmaybe_create_ubatch_slicesfunction inubatch_utils.py, which now provides both padded and unpadded slices. - Updating
gpu_model_runner.pyto conditionally use the correct slices for attention metadata based on thecudagraph_mode. - Simplifying
coordinate_batch_across_dpto only decide whether to microbatch, not how to slice. - Fixing a subtle bug where a padded token count was incorrectly passed as an unpadded count.
The code is now more robust, and the refactoring improves the separation of concerns. The fix is sound, and I have no further recommendations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug related to Dual Batch Overlap (DBO) that caused an assertion failure (assert B_block_table == B_q) with certain cudagraph modes. The root cause was incorrect handling of padded microbatch slices, particularly for PIECEWISE and NONE cudagraphs where only token slices were being padded, leading to shape mismatches in attention metadata.
The fix is comprehensive and well-structured:
- The logic for creating and padding microbatch slices has been refactored and centralized into a new
maybe_create_ubatch_slicesfunction inubatch_utils.py. This function now correctly generates both padded and unpadded versions of the slices. - The padding logic itself has been corrected to pad both request and token slices, resolving the shape mismatch that caused the assertion.
- Crucially, the
gpu_model_runnernow conditionally uses the padded slices for attention metadata construction only when usingFULLcudagraphs, which is the correct behavior. - A related bug was also fixed where sequence-parallelism-padded token counts were incorrectly used to determine microbatching thresholds.
Overall, the changes are correct, improve code clarity through refactoring, and directly address the reported issue. The test plan confirms the fix. The code quality is high.
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
a8963a1 to
9f1b5e1
Compare
Signed-off-by: Lucas Wilkinson <[email protected]>
Since #28579 we could end up building with padded ubatch slices for PIECEWISE/NONE cudagraphs, since CUTLASS MLA assumes num_decodes_reqs == num_decode_tokens this could cause issues since only the tokens were padded; this fixes it by making sure we only use padded ubatch slices for metadata construction when using FULL cudagraphs
Failing CI: https://buildkite.com/organizations/vllm/pipelines/ci/builds/41668/jobs/019ae1ba-2178-4a96-adba-97fb4a46fe4f/log
Test Plan:
python -m pytest tests/v1/distributed/test_dbo.pywas failing on B200 machines (was passing on H100s due to the backends not asserting on this)Passes now