[Feature] Optimize Prefill Phase: Add Hybrid Chunked Prefill Support #26625

Ther-LF · 2025-10-11T08:19:50Z

Optimize Prefill Phase: Add Hybrid Chunked Prefill Support

Description

This PR introduces Hybrid Chunked Prefill, an optimization designed to dynamically switch between continuous prefill and chunked prefill in the vLLM serving pipeline.

Why this matters:
Today users enable chunked prefill mainly to reduce inter-token latency (ITL) when prefill and decode overlap. But chunking also splits long prefill segments, increasing launch/coord overhead and hurting throughput. The current strategy applies chunked prefill unconditionally, so we keep paying the throughput tax even when no decode requests are running.
Hybrid Chunked Prefill fixes this by enabling chunking only when decode is active; otherwise it falls back to continuous prefill, recovering baseline throughput while still preserving the ITL benefits when needed.

This feature enables vLLM to achieve higher throughput and lower latency in both low and high concurrency scenarios.

Purpose

Implement adaptive hybrid prefill scheduling that improves efficiency across different concurrency levels.
Reduce prefill fragmentation and launch overhead when decode traffic is low.
Enhance scalability and stability for large models such as QwQ-32B in multi-request serving.

Test Plan

Model: QwQ-32B
Dataset: 20 representative prompts (mixed-length, real inference-like workloads)
Setup: Evaluate with --concurrency {1, 2, 4, 8}
Comparison:
1. --enable-chunked-prefill --max-num-batched-tokens 1024
2. --enable-hybrid-chunked-prefill --max-num-batched-tokens 1024
Metrics:
- Request throughput (req/s)
- Token throughput (tok/s)
- Time to First Token (TTFT)
- Time per Output Token (TPOT)
- End-to-End Latency (E2E)

Test Result

Concurrency	Mode	Request Throughput (req/s)	Token Throughput (tok/s)	Mean TTFT (ms)	Mean TPOT (ms)	Mean E2E (ms)
1	Hybrid Chunked (1024)	0.26	3643.7	1085.9	7.21	3851.0
	Chunked (1024)	0.26	3580.4	1368.8	7.20	3910.3
2	Hybrid Chunked (1024)	0.35	4928.4	1410.0	9.79	5418.1
	Chunked (1024)	0.34	4811.6	1400.5	9.55	5549.8
4	Hybrid Chunked (1024)	0.45	6336.2	1601.6	15.9	8080.0
	Chunked (1024)	0.45	6281.1	1645.5	15.9	8107.3
8	Hybrid Chunked (1024)	0.52	7335.9	2830.5	30.5	13556.9
	Chunked (1024)	0.51	7206.1	2847.5	31.3	13599.8

Observations

+2–5% higher total token throughput across all concurrency levels.
10–20% lower TTFT, especially under low concurrency.
Stable scaling up to concurrency = 8, maintaining high throughput without latency degradation.
Prefill efficiency improves since hybrid mode avoids unnecessary chunk splitting when decode is idle.

Example (QwQ-32B @ concurrency=8)

Input tokens: 272,789

Generated tokens: 7,981

Token throughput: 7206.13 tok/s

Request throughput: 0.51 req/s

Hybrid Chunked Prefill achieves higher throughput with equal or better latency, demonstrating its effectiveness on large-scale inference workloads.

Comment

This PR adds an adaptive prefill mechanism to vLLM that dynamically balances throughput and latency.
By intelligently enabling chunked prefill only when necessary, it significantly improves efficiency for models like QwQ-32B under real concurrent serving scenarios.

Essential Elements of an Effective PR Description Checklist

Purpose: Introduce hybrid prefill scheduling to optimize serving performance.
Test Plan: Detailed benchmark procedure and dataset.
Test Results: Comparative performance table and observations.
(Optional) Documentation update (supported_features.md).
(Optional) Release notes update for upcoming vLLM version.

gemini-code-assist

Code Review

This pull request introduces a hybrid chunked prefill optimization, which is a great feature for improving performance. The implementation is mostly sound, but I've identified a critical bug in the configuration logic. The initialization for prefill_max_num_batched_tokens is incorrectly placed within a conditional block, which could lead to incorrect behavior if a user specifies max_num_batched_tokens. I've provided a detailed comment with a suggested fix to address this issue, which is crucial for the correctness of this new feature.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/core/sched/scheduler.py

Ther-LF · 2025-10-11T08:35:54Z

Code Review

This pull request introduces a hybrid chunked prefill optimization, which is a great feature for improving performance. The implementation is mostly sound, but I've identified a critical bug in the configuration logic. The initialization for prefill_max_num_batched_tokens is incorrectly placed within a conditional block, which could lead to incorrect behavior if a user specifies max_num_batched_tokens. I've provided a detailed comment with a suggested fix to address this issue, which is crucial for the correctness of this new feature.

The prefill_max_num_batched_tokens is unconditionally initialized at line 200, before the if self.max_num_batched_tokens is None: conditional block:

self.prefill_max_num_batched_tokens = max(self.max_model_len, DEFAULT_MAX_NUM_BATCHED_TOKENS)

Ther-LF · 2025-10-16T06:49:34Z

@WoosukKwon could you take a look at this PR since you reviewed related changes before? Thanks!

mergify · 2025-10-16T19:48:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ther-LF.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Ther-LF · 2025-10-17T04:01:04Z

@russellb could you take a look at this PR since you reviewed related changes before? Thanks!

hmellor

Please fix the following before review:

new config args
pre-commit
dco
docs build

hmellor · 2025-10-17T09:04:26Z

vllm/config/scheduler.py

    This config has no static default. If left unspecified by the user, it will
    be set in `EngineArgs.create_engine_config` based on the usage context."""

+    prefill_max_num_batched_tokens: SkipValidation[int] = None


Why SkipValidation? If it's because this will be set later by us if None you can do:

Suggested change

prefill_max_num_batched_tokens: SkipValidation[int] = None

prefill_max_num_batched_tokens: int = Field(default=None)

This will skip validateion for the default None but validate passed values.

hmellor · 2025-10-17T09:04:50Z

vllm/config/scheduler.py

    """If True, prefill requests can be chunked based
    on the remaining max_num_batched_tokens."""

+    enable_hybrid_chunked_prefill: SkipValidation[bool] = None  # type: ignore


Why SkipValidation? If it's because this will be set later by us if None you can do:

Suggested change

enable_hybrid_chunked_prefill: SkipValidation[bool] = None # type: ignore

enable_hybrid_chunked_prefill: bool = Field(default=None)

This will skip validateion for the default None but validate passed values.

Ther-LF · 2025-10-17T10:08:24Z

@hmellor Thanks for the suggestion! I’ve updated both fields to use Field(default=None) and annotated them as int | None / bool | None, so defaults aren’t validated but user-provided values are. None will be resolved in EngineArgs.create_engine_config based on the usage context.

from pydantic import Field

prefill_max_num_batched_tokens: int | None = Field(default=None)
enable_hybrid_chunked_prefill: bool | None = Field(default=None)

Signed-off-by: Ther-LF <[email protected]>

…id-chunked-prefill

Signed-off-by: Ther-LF <[email protected]>

…id-chunked-prefill

Signed-off-by: Ther-LF <[email protected]>

Ther-LF · 2025-11-10T12:38:14Z

Hi @hmellor, @njhill,

Could you please take a look at my latest changes when you have a moment? I’ve implemented the approach we discussed and would really appreciate any feedback on places where the code could be simplified or improved.

Signed-off-by: Ther-LF <[email protected]>

…id-chunked-prefill

Signed-off-by: Ther-LF <[email protected]>

mergify · 2025-11-12T03:04:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ther-LF.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: TherLF <[email protected]>

njhill · 2025-11-12T05:13:13Z

@Ther-LF apologies I got sidetracked and then was sick the last couple of days. I will get back to it this week.

Ther-LF · 2025-11-12T06:05:57Z

@njhill No worries at all—hope you feel better soon! Please take care.
If you have any suggestions or questions about the code, just let me know anytime and I’ll update it right away.

Signed-off-by: TherLF <[email protected]>

Ther-LF requested review from ApostaC, ProExpertProg, WoosukKwon, alexm-redhat, comaniac, heheda12345, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners October 11, 2025 08:19

mergify bot added the v1 label Oct 11, 2025

gemini-code-assist bot reviewed Oct 11, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 11, 2025

View reviewed changes

vllm/v1/core/sched/scheduler.py Outdated Show resolved Hide resolved

Ther-LF changed the title ~~Optimize Prefill Phase: Add Hybrid Chunked Prefill Support~~ [Feature] Optimize Prefill Phase: Add Hybrid Chunked Prefill Support Oct 11, 2025

mergify bot added the needs-rebase label Oct 16, 2025

Ther-LF force-pushed the hybrid-chunked-prefill branch from 863e537 to a434a2f Compare October 17, 2025 06:26

mergify bot removed the needs-rebase label Oct 17, 2025

hmellor previously requested changes Oct 17, 2025

View reviewed changes

Ther-LF force-pushed the hybrid-chunked-prefill branch from 20306b3 to f8f9bee Compare October 17, 2025 10:06

Merge upstream/main: resolve conflicts in scheduler

1d03ed8

Signed-off-by: Ther-LF <[email protected]>

Ther-LF force-pushed the hybrid-chunked-prefill branch from 085cf83 to 1d03ed8 Compare November 4, 2025 15:24

Ther-LF added 3 commits November 4, 2025 15:42

Pydantic validation for prefill_max_num_batched_tokens

da2dbdd

Signed-off-by: Ther-LF <[email protected]>

Merge branch 'main' of https://github.com/vllm-project/vllm into hybr…

2471c54

…id-chunked-prefill

Pydantic validation for prefill_max_num_batched_tokens

449d34f

Signed-off-by: Ther-LF <[email protected]>

Ther-LF requested a review from hmellor November 4, 2025 16:18

Ther-LF added 3 commits November 7, 2025 07:14

ruff reformat

b3baee6

Signed-off-by: Ther-LF <[email protected]>

fix Get available KV cache blocks

2b5d91f

Signed-off-by: Ther-LF <[email protected]>

Merge branch 'main' of https://github.com/vllm-project/vllm into hybr…

6cb2f2e

…id-chunked-prefill

Ther-LF force-pushed the hybrid-chunked-prefill branch from bdb47d3 to 6cb2f2e Compare November 7, 2025 07:24

Merge upstream/main: resolve conflicts

62b0232

Signed-off-by: Ther-LF <[email protected]>

Ther-LF force-pushed the hybrid-chunked-prefill branch from b6192ba to 62b0232 Compare November 10, 2025 12:36

Ther-LF added 3 commits November 10, 2025 13:24

fix doc_build failed

3907654

Signed-off-by: Ther-LF <[email protected]>

Merge branch 'main' of https://github.com/vllm-project/vllm into hybr…

c2bb898

…id-chunked-prefill

resolve upstream/main conflict

d520415

Signed-off-by: Ther-LF <[email protected]>

Ther-LF force-pushed the hybrid-chunked-prefill branch from a942a0f to d520415 Compare November 11, 2025 07:09

Ther-LF added 2 commits November 11, 2025 20:48

Merge branch 'main' into hybrid-chunked-prefill

1cfedf1

Merge branch 'main' into hybrid-chunked-prefill

f145e41

mergify bot added the needs-rebase label Nov 12, 2025

Merge branch 'main' into hybrid-chunked-prefill

49866a1

Signed-off-by: TherLF <[email protected]>

mergify bot removed the needs-rebase label Nov 12, 2025

Merge branch 'main' into hybrid-chunked-prefill

1e5bb13

Ther-LF added 4 commits November 12, 2025 14:06

Merge branch 'main' into hybrid-chunked-prefill

001f91f

Merge branch 'main' into hybrid-chunked-prefill

07dc760

Merge branch 'main' into hybrid-chunked-prefill

0fc18f2

Signed-off-by: TherLF <[email protected]>

Update arg_utils.py

02e4b81

	prefill_max_num_batched_tokens: SkipValidation[int] = None
	prefill_max_num_batched_tokens: int = Field(default=None)

	enable_hybrid_chunked_prefill: SkipValidation[bool] = None # type: ignore
	enable_hybrid_chunked_prefill: bool = Field(default=None)

Uh oh!

[Feature] Optimize Prefill Phase: Add Hybrid Chunked Prefill Support #26625

Are you sure you want to change the base?

[Feature] Optimize Prefill Phase: Add Hybrid Chunked Prefill Support #26625

Conversation

Ther-LF commented Oct 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optimize Prefill Phase: Add Hybrid Chunked Prefill Support

Description

Purpose

Test Plan

Test Result

Observations

Example (QwQ-32B @ concurrency=8)

Comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Ther-LF commented Oct 11, 2025

Code Review

Uh oh!

Ther-LF commented Oct 16, 2025

Uh oh!

mergify bot commented Oct 16, 2025

Uh oh!

Ther-LF commented Oct 17, 2025

Uh oh!

hmellor left a comment

Choose a reason for hiding this comment

Uh oh!

hmellor Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

hmellor Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Ther-LF commented Oct 17, 2025

Uh oh!

Ther-LF commented Nov 10, 2025

Uh oh!

mergify bot commented Nov 12, 2025

Uh oh!

njhill commented Nov 12, 2025

Uh oh!

Ther-LF commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ther-LF commented Oct 11, 2025 •

edited by github-actions bot

Loading