[Bug] Scheduler IMA at high concurrency

### Checklist

- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
- [x] 5. Please use English, otherwise it will be closed.

### Describe the bug



```
[2025-10-17 17:24:21] INFO:     127.0.0.1:45546 - "POST /generate HTTP/1.1" 200 OK
[2025-10-17 17:24:21] INFO:     127.0.0.1:64983 - "POST /generate HTTP/1.1" 200 OK
[2025-10-17 17:24:21] INFO:     127.0.0.1:22743 - "POST /generate HTTP/1.1" 200 OK
[2025-10-17 17:24:21] INFO:     127.0.0.1:23776 - "POST /generate HTTP/1.1" 200 OK
[2025-10-17 17:24:21] INFO:     127.0.0.1:16184 - "POST /generate HTTP/1.1" 200 OK
[2025-10-17 17:24:21] INFO:     127.0.0.1:46376 - "POST /generate HTTP/1.1" 200 OK
[rank0]:[E1017 17:24:21.083428281 ProcessGroupNCCL.cpp:2068] [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7ed917b4eeb0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x111c7 (0x7ed917be11c7 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7ed89f4630c0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7ed89f4728a8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x978 (0x7ed89f4759c8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xd2 (0x7ed89f477942 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7eda96cdc253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7eda99c94ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126850 (0x7eda99d26850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[2025-10-17 17:24:21 TP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3056, in run_scheduler_process
    scheduler.event_loop_overlap()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1039, in event_loop_overlap
    batch = self.get_next_batch_to_run()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1925, in get_next_batch_to_run
    self.running_batch = self.update_running_batch(self.running_batch)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2127, in update_running_batch
    batch.filter_batch()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 1604, in filter_batch
    self.seq_lens_sum = self.seq_lens.sum().item()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


  what():  [PG ID 2 PG GUID 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7ed917b4eeb0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x111c7 (0x7ed917be11c7 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7ed89f4630c0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7ed89f4728a8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x978 (0x7ed89f4759c8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xd2 (0x7ed89f477942 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xdc253 (0x7eda96cdc253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x94ac3 (0x7eda99c94ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x126850 (0x7eda99d26850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7ed917b4eeb0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe4ec21 (0x7ed89f44ec21 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x945664 (0x7ed89ef45664 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x7eda96cdc253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x7eda99c94ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: <unknown function> + 0x126850 (0x7eda99d26850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[2025-10-17 17:24:21] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
terminate called recursively
Fatal Python error: Aborted
```

### Reproduction

```
python -m sglang.launch_server \
  --model-path /mnt/external-quantized-models/models/nvidia__Llama-4-Maverick-17B-128E-Instruct-FP8 \
  --tp 8 \
  --attention-backend triton \
  --model-loader-extra-config '{
    "enable_multithread_load": true,
    "num_threads": 8
  }' \
  --trust-remote-code
```

2/2 times I can reproduce this issue on latest main (B200)

```
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
```

### Environment

```
python3 -m sglang.check_env
Python: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA B200
GPU 0,1,2,3,4,5,6,7 Compute Capability: 10.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 575.57.08
PyTorch: 2.8.0+cu129
sglang: 0.5.3.post3
sgl_kernel: 0.3.15
flashinfer_python: 0.4.0
triton: 3.4.0
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.2
aiohttp: 3.12.15
fastapi: 0.116.1
hf_transfer: 0.1.9
huggingface_hub: 0.34.0
interegular: 0.3.3
modelscope: 1.29.1
orjson: 3.11.3
outlines: 0.1.11
packaging: 25.0
psutil: 7.0.0
pydantic: 2.11.7
python-multipart: 0.0.20
pyzmq: 27.0.2
uvicorn: 0.35.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.25
openai: 1.99.1
tiktoken: 0.11.0
anthropic: 0.71.0
litellm: Module Not Found
decord: Module Not Found
Hypervisor vendor: KVM
ulimit soft: 1024
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Scheduler IMA at high concurrency #11770

Checklist

Describe the bug

Reproduction

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Scheduler IMA at high concurrency #11770

Description

Checklist

Describe the bug

Reproduction

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions