-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[Feature] Support mamba radix cache v0 #11214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: hanming-lu <[email protected]> Co-authored-by: hzh0425 <[email protected]> Co-authored-by: thalahors <[email protected]>
Summary of ChangesHello @yizhang2077, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a foundational feature for supporting Mamba radix cache (v0) within the SGLang system. The core objective is to enhance the efficiency of KV cache management for models incorporating Mamba architectures. This is achieved by implementing a specialized radix tree that intelligently handles both standard and Mamba-specific KV states, allowing for better resource utilization and faster inference. The changes span across memory allocation, request scheduling, and cache eviction policies, culminating in significant performance gains as evidenced by the provided benchmarks. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for Mamba radix cache, which is a significant feature enhancement. The implementation is comprehensive, touching upon scheduling, memory management, and the model execution flow. The new MambaRadixCache
is well-structured, and unit tests have been added. I've identified a few areas for improvement, including a potential bug in an assertion, a type hint mismatch, and the use of a magic number that should be a constant. Overall, this is a solid contribution.
if self.is_hybrid_gdn: | ||
max_num_reqs = min(max_num_reqs, self.server_args.max_mamba_cache_size) | ||
# for mamba cache radix, it need be divided by 3 (magic number now). (yizhang2077) | ||
max_num_reqs = min(max_num_reqs, self.server_args.max_mamba_cache_size // 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code uses a magic number 3
to divide max_mamba_cache_size
. The comment acknowledges this. It's better to define this as a named constant with a clear explanation of why this division is necessary. This improves code readability and maintainability. For example: MAMBA_CACHE_REQS_RATIO = 3
could be defined at the top of the file or in a constants module.
You need to fix the typo and rename the token_msg variables to token_usage_msg:
|
|
||
|
||
class MambaRadixCache(BasePrefixCache): | ||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it compatible with MTP? EAGLE fix also should be applied to MambaRadixCache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think maybe we can do it in another PR
During testing, I discovered that the server crashed when token_usage reached more than 0.99. |
Do you have reproduce command? I think token_usage > 0.99 is an abnormal state. (It is too large and other models will crash as well in this state) |
reproduce command (server H100, tp-size=2): The root cause is incorrect memory availability checking in the Mamba pool. Instead of checking the Mamba pool, only the MHA pool is used, which leads to attempted memory allocation in a full Mamba pool and subsequent server crash due to None being returned from mamba_pool.alloc(). Clode Sonnet's recommendations:
|
@Swipe4057 Mamba pool controls memory capability by setting available size to 3x |
Run the service with the command and try testing again: Model: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct |
72812db
to
67d4e34
Compare
@Swipe4057 I have tried and it is ok. Could you share your server log? (error and |
You shouldn't limit the number of queries in operation with the new magic number 3 for all operating modes. You need to include the condition that radix cache is enabled. sglang/python/sglang/srt/model_executor/model_runner.py Lines 1430 to 1432 in 67d4e34
|
Restart no longer occurs. I tested the new code, here are my results, I wrote the server launch command earlier: main, disable radix: mr, enable radix: Full:without radix cache{'backend': 'sglang', 'dataset_name': 'generated-shared-prefix', 'request_rate': inf, 'max_concurrency': None, 'sharegpt_output_len': None, 'random_input_len': 1024, 'random_output_len': 1024, 'random_range_ratio': 0.0, 'duration': 225.5629655458033, 'completed': 1024, 'total_input_tokens': 4039046, 'total_output_tokens': 839680, 'total_output_tokens_retokenized': 839041, 'request_throughput': 4.5397523371010315, 'input_throughput': 17906.512224764232, 'output_throughput': 3722.596916422846, 'mean_e2e_latency_ms': 131048.26526600846, 'median_e2e_latency_ms': 114622.40997888148, 'std_e2e_latency_ms': 63244.031925246876, 'p99_e2e_latency_ms': 225238.2608978264, 'mean_ttft_ms': 102325.65577756759, 'median_ttft_ms': 91108.4556542337, 'std_ttft_ms': 62789.35906378496, 'p99_ttft_ms': 202928.22604222223, 'mean_tpot_ms': 35.070341255727556, 'median_tpot_ms': 33.90271993074225, 'std_tpot_ms': 5.1958526217693874, 'p99_tpot_ms': 45.5167203996517, 'mean_itl_ms': 35.07019149012473, 'median_itl_ms': 28.129231184720993, 'std_itl_ms': 245.95497367320698, 'p95_itl_ms': 29.61028926074505, 'p99_itl_ms': 35.236083529889754, 'concurrency': 594.9266685143978, 'accept_length': None} with radix cache{'backend': 'sglang', 'dataset_name': 'generated-shared-prefix', 'request_rate': inf, 'max_concurrency': None, 'sharegpt_output_len': None, 'random_input_len': 1024, 'random_output_len': 1024, 'random_range_ratio': 0.0, 'duration': 220.95910985954106, 'completed': 1024, 'total_input_tokens': 4039046, 'total_output_tokens': 839680, 'total_output_tokens_retokenized': 839290, 'request_throughput': 4.634341623891111, 'input_throughput': 18279.608397080956, 'output_throughput': 3800.1601315907114, 'mean_e2e_latency_ms': 133114.31396250374, 'median_e2e_latency_ms': 145177.0340781659, 'std_e2e_latency_ms': 59586.806921843796, 'p99_e2e_latency_ms': 220620.25247588754, 'mean_ttft_ms': 94039.92438557907, 'median_ttft_ms': 100580.67605737597, 'std_ttft_ms': 57945.60860232475, 'p99_ttft_ms': 197400.40578043088, 'mean_tpot_ms': 47.70987738330239, 'median_tpot_ms': 42.28862401516483, 'std_tpot_ms': 28.60189182856475, 'p99_tpot_ms': 227.09880975949528, 'mean_itl_ms': 47.710699266049296, 'median_itl_ms': 30.049152672290802, 'std_itl_ms': 880.109046863313, 'p95_itl_ms': 34.64645054191351, 'p99_itl_ms': 279.4529473409058, 'concurrency': 616.8972059321408, 'accept_length': None} bench:(server H100, tp-size=2): log:[[2025-10-05 18](tel:2025-10-05 18):37:30 TP0 EP0] Decode batch. #running-req: 235, #full token: 818755, full token usage: 0.99, mamba num: 470, mamba usage: 0.46, cuda graph: True, gen throughput (token/s): 7752.32, #queue-req: 789, [[2025-10-05 18](tel:2025-10-05 18):37:31 TP0 EP0] Decode batch. #running-req: 235, #full token: 828155, full token usage: 1.00, mamba num: 470, mamba usage: 0.46, cuda graph: True, gen throughput (token/s): 7751.90, #queue-req: 789, [[2025-10-05 18](tel:2025-10-05 18):37:31 TP0 EP0] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.0980 -> 0.7646 [[2025-10-05 18](tel:2025-10-05 18):37:31 TP1 EP1] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.0980 -> 0.7646 [[2025-10-05 18](tel:2025-10-05 18):37:32 TP1 EP1] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.7506 -> 0.7829 [[2025-10-05 18](tel:2025-10-05 18):37:32 TP0 EP0] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.7506 -> 0.7829 [[2025-10-05 18](tel:2025-10-05 18):37:32 TP0 EP0] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.7679 -> 0.8024 [[2025-10-05 18](tel:2025-10-05 18):37:32 TP1 EP1] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.7679 -> 0.8024 [[2025-10-05 18](tel:2025-10-05 18):37:32 TP0 EP0] Decode batch. #running-req: 232, #full token: 826859, full token usage: 1.00, mamba num: 464, mamba usage: 0.45, cuda graph: True, gen throughput (token/s): 7658.65, #queue-req: 792, [[2025-10-05 18](tel:2025-10-05 18):37:33 TP1 EP1] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.7884 -> 0.8207 [[2025-10-05 18](tel:2025-10-05 18):37:33 TP0 EP0] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.7884 -> 0.8207 [[2025-10-05 18](tel:2025-10-05 18):37:33 TP0 EP0] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.8057 -> 0.8402 [[2025-10-05 18](tel:2025-10-05 18):37:33 TP1 EP1] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.8057 -> 0.8402 [[2025-10-05 18](tel:2025-10-05 18):37:33 TP0 EP0] Decode batch. #running-req: 230, #full token: 828932, full token usage: 1.00, mamba num: 460, mamba usage: 0.45, cuda graph: True, gen throughput (token/s): 7786.25, #queue-req: 794, [[2025-10-05 18](tel:2025-10-05 18):37:34 TP0 EP0] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.8262 -> 0.8585 [[2025-10-05 18](tel:2025-10-05 18):37:34 TP1 EP1] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.8262 -> 0.8585 [[2025-10-05 18](tel:2025-10-05 18):37:34 TP0 EP0] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.8435 -> 0.8781 [[2025-10-05 18](tel:2025-10-05 18):37:34 TP1 EP1] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.8435 -> 0.8781 [[2025-10-05 18](tel:2025-10-05 18):37:35 TP0 EP0] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.8630 -> 0.8976 [[2025-10-05 18](tel:2025-10-05 18):37:35 TP1 EP1] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.8630 -> 0.8976 [[2025-10-05 18](tel:2025-10-05 18):37:35] INFO: 127.0.0.1:33552 - "GET /health HTTP/1.1" 200 OK [[2025-10-05 18](tel:2025-10-05 18):37:35 TP0 EP0] Decode batch. #running-req: 227, #full token: 827227, full token usage: 1.00, mamba num: 454, mamba usage: 0.44, cuda graph: True, gen throughput (token/s): 7641.70, #queue-req: 797, [[2025-10-05 18](tel:2025-10-05 18):37:35 TP0 EP0] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.8825 -> 0.9171 [[2025-10-05 18](tel:2025-10-05 18):37:35 TP1 EP1] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.8825 -> 0.9171 [[2025-10-05 18](tel:2025-10-05 18):37:35 TP0 EP0] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9020 -> 0.9366 [[2025-10-05 18](tel:2025-10-05 18):37:36 TP1 EP1] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9020 -> 0.9366 [[2025-10-05 18](tel:2025-10-05 18):37:36 TP0 EP0] Decode batch. #running-req: 225, #full token: 828970, full token usage: 1.00, mamba num: 450, mamba usage: 0.44, cuda graph: True, gen throughput (token/s): 7525.47, #queue-req: 799, [[2025-10-05 18](tel:2025-10-05 18):37:36 TP1 EP1] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9215 -> 0.9561 [[2025-10-05 18](tel:2025-10-05 18):37:36 TP0 EP0] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9215 -> 0.9561 [[2025-10-05 18](tel:2025-10-05 18):37:36 TP1 EP1] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9400 -> 0.9768 [[2025-10-05 18](tel:2025-10-05 18):37:36 TP0 EP0] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9400 -> 0.9768 [[2025-10-05 18](tel:2025-10-05 18):37:37 TP0 EP0] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9618 -> 0.9963 [[2025-10-05 18](tel:2025-10-05 18):37:37 TP1 EP1] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9618 -> 0.9963 [[2025-10-05 18](tel:2025-10-05 18):37:37 TP0 EP0] Decode batch. #running-req: 222, #full token: 826827, full token usage: 1.00, mamba num: 444, mamba usage: 0.43, cuda graph: True, gen throughput (token/s): 7609.35, #queue-req: 802, [[2025-10-05 18](tel:2025-10-05 18):37:37 TP1 EP1] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9803 -> 1.0000 [[2025-10-05 18](tel:2025-10-05 18):37:37 TP0 EP0] KV cache pool is full. Retract requests. #retracted_reqs: 1, #aborted_retracted_reqs: 0, #new_token_ratio: 0.9803 -> 1.0000 [[2025-10-05 18](tel:2025-10-05 18):37:38 TP0 EP0] Prefill batch. #new-seq: 4, #new-token: 15775, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 220, #queue-req: 799, |
For disable radix cache, I think |
yizhang2077 Do you have any idea why my test is so much worse than yours in performance? Although in my test there are 8 groups of requests with the same system hints of 1000 tokens, in the log I see cached-token: 0 log: [[2025-10-06 06](tel:2025-10-06 06):32:39 TP0 EP0] Prefill batch. #new-seq: 4, #new-token: 15801, #cached-token: 0, full token usage: 0.23, mamba usage: 0.25, #running-req: 129, #queue-req: 891, [[2025-10-06 06](tel:2025-10-06 06):32:39 TP0 EP0] Prefill batch. #new-seq: 4, #new-token: 15785, #cached-token: 0, full token usage: 0.24, mamba usage: 0.26, #running-req: 133, #queue-req: 887, [[2025-10-06 06](tel:2025-10-06 06):32:40 TP0 EP0] Prefill batch. #new-seq: 4, #new-token: 15748, #cached-token: 0, full token usage: 0.25, mamba usage: 0.26, #running-req: 137, #queue-req: 883, [[2025-10-06 06](tel:2025-10-06 06):32:40 TP0 EP0] Prefill batch. #new-seq: 4, #new-token: 15791, #cached-token: 0, full token usage: 0.25, mamba usage: 0.27, #running-req: 141, #queue-req: 879, [[2025-10-06 06](tel:2025-10-06 06):32:40 TP0 EP0] Prefill batch. #new-seq: 4, #new-token: 15719, #cached-token: 0, full token usage: 0.26, mamba usage: 0.28, #running-req: 145, #queue-req: 875, [[2025-10-06 06](tel:2025-10-06 06):32:41 TP0 EP0] Prefill batch. #new-seq: 4, #new-token: 15655, #cached-token: 0, full token usage: 0.27, mamba usage: 0.29, #running-req: 149, #queue-req: 871, [[2025-10-06 06](tel:2025-10-06 06):32:41 TP0 EP0] Prefill batch. #new-seq: 4, #new-token: 15803, #cached-token: 0, full token usage: 0.27, mamba usage: 0.29, #running-req: 153, #queue-req: 867, [[2025-10-06 06](tel:2025-10-06 06):32:42 TP0 EP0] Prefill batch. #new-seq: 4, #new-token: 15746, #cached-token: 0, full token usage: 0.28, mamba usage: 0.30, #running-req: 157, #queue-req: 863, [[2025-10-06 06](tel:2025-10-06 06):32:42 TP0 EP0] Prefill batch. #new-seq: 4, #new-token: 15783, #cached-token: 0, full token usage: 0.29, mamba usage: 0.31, #running-req: 161, #queue-req: 859, [[2025-10-06 06](tel:2025-10-06 06):32:42 TP0 EP0] Prefill batch. #new-seq: 4, #new-token: 15755, #cached-token: 0, full token usage: 0.29, mamba usage: 0.32, #running-req: 165, #queue-req: 855, [[2025-10-06 06](tel:2025-10-06 06):32:43 TP0 EP0] Prefill batch. #new-seq: 4, #new-token: 15784, #cached-token: 0, full token usage: 0.30, mamba usage: 0.33, #running-req: 169, #queue-req: 851, [[2025-10-06 06](tel:2025-10-06 06):32:43 TP0 EP0] Prefill batch. #new-seq: 4, #new-token: 15750, #cached-token: 0, full token usage: 0.31, mamba usage: 0.33, #running-req: 173, #queue-req: 847, [[2025-10-06 06](tel:2025-10-06 06):32:43 TP0 EP0] Prefill batch. #new-seq: 4, #new-token: 15765, #cached-token: 0, full token usage: 0.32, mamba usage: 0.34, #running-req: 177, #queue-req: 843, |
there will be OOM |
Hi @Swipe4057 It might be due to too many requests causing the cache in device to be evicted. You can try adding --max-concurrency 5 |
Understood. Unfortunately, using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good! Most comments are minor, the one regarding the assert in cache_unfinished_req
is more critical. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic looks good to me!
I am not sure about the status of crashing behavior from @Swipe4057 , if it is also fixed, we are good to merge.
I'll test the current code tomorrow. |
seems like mamba tree cache sanity check is not running, let's add it?
|
Scheduling related, no performance overhead if tested with
|
Currently, cache_unfinish_req will fork the mamba cache once, which might be introducing overhead. @hanming-lu |
Please take a look at my test that I conducted earlier #11214 (comment) My results (duplicating here) were as follows, on 2x H100 GPUs: main, disable radix: mr, enable radix: I shared server logs, and all prefills looked like this: That is, cached-token: 0! Although this is impossible, since cache matches should definitely exist) However, if you reduce the number of concurrent requests, for example to 5, or send requests one after another, cache matches do appear! So I think there's a bug here. I can't test again at the moment, our server is undergoing maintenance, I'll verify in the near future. It would also be interesting to test something like this (--gsp-question-prompt-len 0): |
Motivation
ref #10438. add radix cache for mamba, we will implement page_size > 1 and Marconi soon
Co-authored-by: hanming-lu [email protected]
Co-authored-by: hzh0425 [email protected]
Co-authored-by: thalahors [email protected]
Modifications
ref: doc
Accuracy Tests
Benchmarking and Profiling
Checklist