Skip to content

Conversation

@Josephasafg
Copy link
Contributor

@Josephasafg Josephasafg commented Nov 25, 2025

Purpose

There was a bug with Mamba1 when CUDA graphs was enabled where if max_num_seqs was set to a number that is not in
cudagraph_capture_sizes, (e.g. 50), we'd get this error:

(EngineCore_DP0 pid=80423) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 50 (input tensor's size at dimension 0), but got split_sizes=[56, 0]

The issue was that the padded_decodes variable was used for the split and not the actual amount of decodes + prefill amount

Test Plan

This script will raise the error

from vllm import LLM, SamplingParams

llm = LLM(model="ai21labs/AI21-Jamba-Reasoning-3B", enable_prefix_caching=False, max_num_seqs=50)
params = SamplingParams(max_tokens=16)

requests_batch = [
    [{"role": "user", "content": f"Hello #{i}"}]
    for i in range(60)
]

outputs = llm.chat(requests_batch, sampling_params=params)

This script will finish with no errors. Changed max_num_seqs to 56 which is a batch size in capture_cudagraph_size

from vllm import LLM, SamplingParams

llm = LLM(model="ai21labs/AI21-Jamba-Reasoning-3B", enable_prefix_caching=False, max_num_seqs=56)
params = SamplingParams(max_tokens=16)

requests_batch = [
    [{"role": "user", "content": f"Hello #{i}"}]
    for i in range(60)
]

outputs = llm.chat(requests_batch, sampling_params=params)

Test Result

Results are written above

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@Josephasafg
Copy link
Contributor Author

@tdoublep This would have been solved if we had had the consolidation, but I didn't want to mix the PRs. Once we have this merged, I'll open the consolidation PR

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes a RuntimeError in Mamba1 with CUDA graphs by using the actual number of decodes instead of a padded value for tensor splitting. The changes are consistent and address the described bug. However, I've identified a latent bug in vllm/model_executor/layers/mamba/mamba_mixer.py where the code for splitting tensors relies on the assumption that the number of decode tokens equals the number of decode requests. This is currently true for Mamba1 but is a fragile assumption that could break with future changes like speculative decoding. I've provided suggestions to make the code more robust by explicitly handling token and request counts separately.

@Josephasafg Josephasafg changed the title [Bug][Mamba1] - Fix RuntimeError due to wrong split in CUDAGraphs [Bug]: Fix RuntimeError due to wrong split in CUDAGraphs Nov 25, 2025
@Josephasafg Josephasafg changed the title [Bug]: Fix RuntimeError due to wrong split in CUDAGraphs [Bug]: Fix RuntimeError due to wrong split in CUDAGraphs for Mamba1 Nov 25, 2025
@Josephasafg Josephasafg changed the title [Bug]: Fix RuntimeError due to wrong split in CUDAGraphs for Mamba1 [Bugfix]: Fix RuntimeError due to wrong split in CUDAGraphs for Mamba1 Nov 25, 2025
@LucasWilkinson
Copy link
Collaborator

Can you please add "Test Plan" "Test Result" (ideally with a command to repo the bug)

@Josephasafg
Copy link
Contributor Author

@LucasWilkinson Updated Test Plan and Test Results with reproducing steps

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; @tdoublep can you please take a look

@github-project-automation github-project-automation bot moved this to In review in NVIDIA Dec 1, 2025
Copy link
Member

@tdoublep tdoublep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks OK (consistent with Mamba2) but I still not sure I'm 100% clear why this fixes it. In your example, wouldn't the 50 sequences get padded up to batch size 56 for decode?

@tdoublep
Copy link
Member

tdoublep commented Dec 1, 2025

Running your script it looks like the issue is that the tensors we are trying to split don't have the consistent shape:

(EngineCore_DP0 pid=1094920) num_decodes:  50
(EngineCore_DP0 pid=1094920) num_padded_decodes:  56
(EngineCore_DP0 pid=1094920) hidden_states_BC.shape:  torch.Size([5120, 56])
(EngineCore_DP0 pid=1094920) gate.shape:  torch.Size([5120, 56])
(EngineCore_DP0 pid=1094920) state_indices_tensor.shape:  torch.Size([50])

Comment on lines 126 to 131
padded_decodes = self.vllm_config.pad_for_cudagraph(num_decodes)
self.state_indices_tensor[:num_decodes].copy_(
state_indices_tensor, non_blocking=True
)
state_indices_tensor = self.state_indices_tensor[:padded_decodes]
state_indices_tensor[num_decodes:] = PAD_SLOT_ID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this code in particular. It looks like we actively try to make the state_indices_tensor have the size num_padded_decodes (56 in your example), but it actually has size 50 because the underlying common_attn_metadata.block_table_tensor[:, 0] has size 50.

@Josephasafg
Copy link
Contributor Author

It looks like this PR changed the behavior and it seems that state_indices_tensor shape was affected by it with this line -
blk_table.get_device_tensor(num_reqs_padded)

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually preference for existing PR as it fixes more backends
#29352

@Josephasafg
Copy link
Contributor Author

actually preference for existing PR as it fixes more backends

#29352

@LucasWilkinson That's great. Do you think it will be merged before the new release? We need this fix for Jamba as it causes runtime crashes.

If not I'd be happy to get it merged before.

@LucasWilkinson
Copy link
Collaborator

actually preference for existing PR as it fixes more backends
#29352

@LucasWilkinson That's great. Do you think it will be merged before the new release? We need this fix for Jamba as it causes runtime crashes.

If not I'd be happy to get it merged before.

Yes will merged once the CI is online again (which this PR would have to wait for anyways 👍)

@LucasWilkinson
Copy link
Collaborator

closing in favor of: #29352

@github-project-automation github-project-automation bot moved this from In review to Done in NVIDIA Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants