[Bugfix]: Fix RuntimeError due to wrong split in CUDAGraphs for Mamba1 #29404

Josephasafg · 2025-11-25T12:18:44Z

Purpose

There was a bug with Mamba1 when CUDA graphs was enabled where if max_num_seqs was set to a number that is not in
cudagraph_capture_sizes, (e.g. 50), we'd get this error:

(EngineCore_DP0 pid=80423) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 50 (input tensor's size at dimension 0), but got split_sizes=[56, 0]

The issue was that the padded_decodes variable was used for the split and not the actual amount of decodes + prefill amount

Test Plan

This script will raise the error

from vllm import LLM, SamplingParams

llm = LLM(model="ai21labs/AI21-Jamba-Reasoning-3B", enable_prefix_caching=False, max_num_seqs=50)
params = SamplingParams(max_tokens=16)

requests_batch = [
    [{"role": "user", "content": f"Hello #{i}"}]
    for i in range(60)
]

outputs = llm.chat(requests_batch, sampling_params=params)

This script will finish with no errors. Changed max_num_seqs to 56 which is a batch size in capture_cudagraph_size

from vllm import LLM, SamplingParams

llm = LLM(model="ai21labs/AI21-Jamba-Reasoning-3B", enable_prefix_caching=False, max_num_seqs=56)
params = SamplingParams(max_tokens=16)

requests_batch = [
    [{"role": "user", "content": f"Hello #{i}"}]
    for i in range(60)
]

outputs = llm.chat(requests_batch, sampling_params=params)

Test Result

Results are written above

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: asafg <[email protected]>

Josephasafg · 2025-11-25T12:20:30Z

@tdoublep This would have been solved if we had had the consolidation, but I didn't want to mix the PRs. Once we have this merged, I'll open the consolidation PR

gemini-code-assist

Code Review

This pull request correctly fixes a RuntimeError in Mamba1 with CUDA graphs by using the actual number of decodes instead of a padded value for tensor splitting. The changes are consistent and address the described bug. However, I've identified a latent bug in vllm/model_executor/layers/mamba/mamba_mixer.py where the code for splitting tensors relies on the assumption that the number of decode tokens equals the number of decode requests. This is currently true for Mamba1 but is a fragile assumption that could break with future changes like speculative decoding. I've provided suggestions to make the code more robust by explicitly handling token and request counts separately.

LucasWilkinson · 2025-11-27T16:11:02Z

Can you please add "Test Plan" "Test Result" (ideally with a command to repo the bug)

Josephasafg · 2025-11-29T13:21:40Z

@LucasWilkinson Updated Test Plan and Test Results with reproducing steps

LucasWilkinson

LGTM; @tdoublep can you please take a look

tdoublep

The change looks OK (consistent with Mamba2) but I still not sure I'm 100% clear why this fixes it. In your example, wouldn't the 50 sequences get padded up to batch size 56 for decode?

tdoublep · 2025-12-01T15:59:32Z

Running your script it looks like the issue is that the tensors we are trying to split don't have the consistent shape:

(EngineCore_DP0 pid=1094920) num_decodes:  50
(EngineCore_DP0 pid=1094920) num_padded_decodes:  56
(EngineCore_DP0 pid=1094920) hidden_states_BC.shape:  torch.Size([5120, 56])
(EngineCore_DP0 pid=1094920) gate.shape:  torch.Size([5120, 56])
(EngineCore_DP0 pid=1094920) state_indices_tensor.shape:  torch.Size([50])

tdoublep · 2025-12-01T16:07:16Z

vllm/v1/attention/backends/mamba1_attn.py

            padded_decodes = self.vllm_config.pad_for_cudagraph(num_decodes)
            self.state_indices_tensor[:num_decodes].copy_(
                state_indices_tensor, non_blocking=True
            )
            state_indices_tensor = self.state_indices_tensor[:padded_decodes]
            state_indices_tensor[num_decodes:] = PAD_SLOT_ID


I'm confused by this code in particular. It looks like we actively try to make the state_indices_tensor have the size num_padded_decodes (56 in your example), but it actually has size 50 because the underlying common_attn_metadata.block_table_tensor[:, 0] has size 50.

Josephasafg · 2025-12-01T17:25:05Z

It looks like this PR changed the behavior and it seems that state_indices_tensor shape was affected by it with this line -
blk_table.get_device_tensor(num_reqs_padded)

LucasWilkinson

actually preference for existing PR as it fixes more backends
#29352

Josephasafg · 2025-12-01T19:53:29Z

actually preference for existing PR as it fixes more backends

#29352

@LucasWilkinson That's great. Do you think it will be merged before the new release? We need this fix for Jamba as it causes runtime crashes.

If not I'd be happy to get it merged before.

LucasWilkinson · 2025-12-01T20:35:42Z

actually preference for existing PR as it fixes more backends
#29352

@LucasWilkinson That's great. Do you think it will be merged before the new release? We need this fix for Jamba as it causes runtime crashes.

If not I'd be happy to get it merged before.

Yes will merged once the CI is online again (which this PR would have to wait for anyways 👍)

LucasWilkinson · 2025-12-02T18:48:35Z

closing in favor of: #29352

fix runtime error due to wrong split

cfacb4f

Signed-off-by: asafg <[email protected]>

Josephasafg requested review from LucasWilkinson and tdoublep as code owners November 25, 2025 12:18

mergify bot added nvidia v1 labels Nov 25, 2025

github-project-automation bot added this to NVIDIA Nov 25, 2025

gemini-code-assist bot reviewed Nov 25, 2025

View reviewed changes

Josephasafg changed the title ~~[Bug][Mamba1] - Fix RuntimeError due to wrong split in CUDAGraphs~~ [Bug]: Fix RuntimeError due to wrong split in CUDAGraphs Nov 25, 2025

Josephasafg changed the title ~~[Bug]: Fix RuntimeError due to wrong split in CUDAGraphs~~ [Bug]: Fix RuntimeError due to wrong split in CUDAGraphs for Mamba1 Nov 25, 2025

Josephasafg changed the title ~~[Bug]: Fix RuntimeError due to wrong split in CUDAGraphs for Mamba1~~ [Bugfix]: Fix RuntimeError due to wrong split in CUDAGraphs for Mamba1 Nov 25, 2025

Merge branch 'main' into fix_mamba1_cudagraph

c5824b1

LucasWilkinson approved these changes Dec 1, 2025

View reviewed changes

github-project-automation bot moved this to In review in NVIDIA Dec 1, 2025

tdoublep reviewed Dec 1, 2025

View reviewed changes

LucasWilkinson requested changes Dec 1, 2025

View reviewed changes

LucasWilkinson closed this Dec 2, 2025

github-project-automation bot moved this from In review to Done in NVIDIA Dec 2, 2025

Uh oh!

[Bugfix]: Fix RuntimeError due to wrong split in CUDAGraphs for Mamba1 #29404

[Bugfix]: Fix RuntimeError due to wrong split in CUDAGraphs for Mamba1 #29404

Uh oh!

Conversation

Josephasafg commented Nov 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Results are written above

Uh oh!

Josephasafg commented Nov 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

LucasWilkinson commented Nov 27, 2025

Uh oh!

Josephasafg commented Nov 29, 2025

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

tdoublep commented Dec 1, 2025

Uh oh!

tdoublep Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Josephasafg commented Dec 1, 2025

Uh oh!

LucasWilkinson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Josephasafg commented Dec 1, 2025

Uh oh!

LucasWilkinson commented Dec 1, 2025

Uh oh!

LucasWilkinson commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Josephasafg commented Nov 25, 2025 •

edited by github-actions bot

Loading

LucasWilkinson left a comment •

edited

Loading