-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Feature] Optimize Prefill Phase: Add Hybrid Chunked Prefill Support #26625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a hybrid chunked prefill optimization, which is a great feature for improving performance. The implementation is mostly sound, but I've identified a critical bug in the configuration logic. The initialization for prefill_max_num_batched_tokens is incorrectly placed within a conditional block, which could lead to incorrect behavior if a user specifies max_num_batched_tokens. I've provided a detailed comment with a suggested fix to address this issue, which is crucial for the correctness of this new feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
The prefill_max_num_batched_tokens is unconditionally initialized at line 200, before the if self.max_num_batched_tokens is None: conditional block: self.prefill_max_num_batched_tokens = max(self.max_model_len, DEFAULT_MAX_NUM_BATCHED_TOKENS) |
|
@WoosukKwon could you take a look at this PR since you reviewed related changes before? Thanks! |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@russellb could you take a look at this PR since you reviewed related changes before? Thanks! |
863e537 to
a434a2f
Compare
hmellor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the following before review:
- new config args
- pre-commit
- dco
- docs build
vllm/config/scheduler.py
Outdated
| This config has no static default. If left unspecified by the user, it will | ||
| be set in `EngineArgs.create_engine_config` based on the usage context.""" | ||
|
|
||
| prefill_max_num_batched_tokens: SkipValidation[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why SkipValidation? If it's because this will be set later by us if None you can do:
| prefill_max_num_batched_tokens: SkipValidation[int] = None | |
| prefill_max_num_batched_tokens: int = Field(default=None) |
This will skip validateion for the default None but validate passed values.
vllm/config/scheduler.py
Outdated
| """If True, prefill requests can be chunked based | ||
| on the remaining max_num_batched_tokens.""" | ||
|
|
||
| enable_hybrid_chunked_prefill: SkipValidation[bool] = None # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why SkipValidation? If it's because this will be set later by us if None you can do:
| enable_hybrid_chunked_prefill: SkipValidation[bool] = None # type: ignore | |
| enable_hybrid_chunked_prefill: bool = Field(default=None) |
This will skip validateion for the default None but validate passed values.
20306b3 to
f8f9bee
Compare
|
@hmellor Thanks for the suggestion! I’ve updated both fields to use from pydantic import Field
prefill_max_num_batched_tokens: int | None = Field(default=None)
enable_hybrid_chunked_prefill: bool | None = Field(default=None) |
Signed-off-by: Ther-LF <[email protected]>
085cf83 to
1d03ed8
Compare
Signed-off-by: Ther-LF <[email protected]>
…id-chunked-prefill
Signed-off-by: Ther-LF <[email protected]>
Signed-off-by: Ther-LF <[email protected]>
Signed-off-by: Ther-LF <[email protected]>
…id-chunked-prefill
bdb47d3 to
6cb2f2e
Compare
Signed-off-by: Ther-LF <[email protected]>
b6192ba to
62b0232
Compare
Signed-off-by: Ther-LF <[email protected]>
…id-chunked-prefill
Signed-off-by: Ther-LF <[email protected]>
a942a0f to
d520415
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: TherLF <[email protected]>
|
@Ther-LF apologies I got sidetracked and then was sick the last couple of days. I will get back to it this week. |
|
@njhill No worries at all—hope you feel better soon! Please take care. |
Optimize Prefill Phase: Add Hybrid Chunked Prefill Support
Description
This PR introduces Hybrid Chunked Prefill, an optimization designed to dynamically switch between continuous prefill and chunked prefill in the vLLM serving pipeline.
Why this matters:
Today users enable chunked prefill mainly to reduce inter-token latency (ITL) when prefill and decode overlap. But chunking also splits long prefill segments, increasing launch/coord overhead and hurting throughput. The current strategy applies chunked prefill unconditionally, so we keep paying the throughput tax even when no decode requests are running.
Hybrid Chunked Prefill fixes this by enabling chunking only when decode is active; otherwise it falls back to continuous prefill, recovering baseline throughput while still preserving the ITL benefits when needed.
This feature enables vLLM to achieve higher throughput and lower latency in both low and high concurrency scenarios.
Purpose
Test Plan
QwQ-32B--concurrency {1, 2, 4, 8}--enable-chunked-prefill --max-num-batched-tokens 1024--enable-hybrid-chunked-prefill --max-num-batched-tokens 1024Test Result
Observations
Example (QwQ-32B @ concurrency=8)
Hybrid Chunked Prefill achieves higher throughput with equal or better latency, demonstrating its effectiveness on large-scale inference workloads.
Comment
This PR adds an adaptive prefill mechanism to vLLM that dynamically balances throughput and latency.
By intelligently enabling chunked prefill only when necessary, it significantly improves efficiency for models like QwQ-32B under real concurrent serving scenarios.
Essential Elements of an Effective PR Description Checklist
supported_features.md).