-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Bugfix]: Fix RuntimeError due to wrong split in CUDAGraphs for Mamba1 #29404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: asafg <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes a RuntimeError in Mamba1 with CUDA graphs by using the actual number of decodes instead of a padded value for tensor splitting. The changes are consistent and address the described bug. However, I've identified a latent bug in vllm/model_executor/layers/mamba/mamba_mixer.py where the code for splitting tensors relies on the assumption that the number of decode tokens equals the number of decode requests. This is currently true for Mamba1 but is a fragile assumption that could break with future changes like speculative decoding. I've provided suggestions to make the code more robust by explicitly handling token and request counts separately.
|
Can you please add "Test Plan" "Test Result" (ideally with a command to repo the bug) |
|
@LucasWilkinson Updated Test Plan and Test Results with reproducing steps |
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; @tdoublep can you please take a look
tdoublep
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks OK (consistent with Mamba2) but I still not sure I'm 100% clear why this fixes it. In your example, wouldn't the 50 sequences get padded up to batch size 56 for decode?
|
Running your script it looks like the issue is that the tensors we are trying to split don't have the consistent shape: |
| padded_decodes = self.vllm_config.pad_for_cudagraph(num_decodes) | ||
| self.state_indices_tensor[:num_decodes].copy_( | ||
| state_indices_tensor, non_blocking=True | ||
| ) | ||
| state_indices_tensor = self.state_indices_tensor[:padded_decodes] | ||
| state_indices_tensor[num_decodes:] = PAD_SLOT_ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this code in particular. It looks like we actively try to make the state_indices_tensor have the size num_padded_decodes (56 in your example), but it actually has size 50 because the underlying common_attn_metadata.block_table_tensor[:, 0] has size 50.
|
It looks like this PR changed the behavior and it seems that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually preference for existing PR as it fixes more backends
#29352
@LucasWilkinson That's great. Do you think it will be merged before the new release? We need this fix for Jamba as it causes runtime crashes. If not I'd be happy to get it merged before. |
Yes will merged once the CI is online again (which this PR would have to wait for anyways 👍) |
|
closing in favor of: #29352 |
Purpose
There was a bug with Mamba1 when CUDA graphs was enabled where if
max_num_seqswas set to a number that is not incudagraph_capture_sizes, (e.g. 50), we'd get this error:(EngineCore_DP0 pid=80423) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 50 (input tensor's size at dimension 0), but got split_sizes=[56, 0]The issue was that the
padded_decodesvariable was used for the split and not the actual amount of decodes + prefill amountTest Plan
This script will raise the error
This script will finish with no errors. Changed
max_num_seqsto 56 which is a batch size incapture_cudagraph_sizeTest Result
Results are written above
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.