Skip to content

Conversation

@Ther-LF
Copy link
Contributor

@Ther-LF Ther-LF commented Oct 11, 2025

Optimize Prefill Phase: Add Hybrid Chunked Prefill Support

Description

This PR introduces Hybrid Chunked Prefill, an optimization designed to dynamically switch between continuous prefill and chunked prefill in the vLLM serving pipeline.

Why this matters:
Today users enable chunked prefill mainly to reduce inter-token latency (ITL) when prefill and decode overlap. But chunking also splits long prefill segments, increasing launch/coord overhead and hurting throughput. The current strategy applies chunked prefill unconditionally, so we keep paying the throughput tax even when no decode requests are running.
Hybrid Chunked Prefill fixes this by enabling chunking only when decode is active; otherwise it falls back to continuous prefill, recovering baseline throughput while still preserving the ITL benefits when needed.

This feature enables vLLM to achieve higher throughput and lower latency in both low and high concurrency scenarios.


Purpose

  • Implement adaptive hybrid prefill scheduling that improves efficiency across different concurrency levels.
  • Reduce prefill fragmentation and launch overhead when decode traffic is low.
  • Enhance scalability and stability for large models such as QwQ-32B in multi-request serving.

Test Plan

  • Model: QwQ-32B
  • Dataset: 20 representative prompts (mixed-length, real inference-like workloads)
  • Setup: Evaluate with --concurrency {1, 2, 4, 8}
  • Comparison:
    1. --enable-chunked-prefill --max-num-batched-tokens 1024
    2. --enable-hybrid-chunked-prefill --max-num-batched-tokens 1024
  • Metrics:
    • Request throughput (req/s)
    • Token throughput (tok/s)
    • Time to First Token (TTFT)
    • Time per Output Token (TPOT)
    • End-to-End Latency (E2E)

Test Result

Concurrency Mode Request Throughput (req/s) Token Throughput (tok/s) Mean TTFT (ms) Mean TPOT (ms) Mean E2E (ms)
1 Hybrid Chunked (1024) 0.26 3643.7 1085.9 7.21 3851.0
Chunked (1024) 0.26 3580.4 1368.8 7.20 3910.3
2 Hybrid Chunked (1024) 0.35 4928.4 1410.0 9.79 5418.1
Chunked (1024) 0.34 4811.6 1400.5 9.55 5549.8
4 Hybrid Chunked (1024) 0.45 6336.2 1601.6 15.9 8080.0
Chunked (1024) 0.45 6281.1 1645.5 15.9 8107.3
8 Hybrid Chunked (1024) 0.52 7335.9 2830.5 30.5 13556.9
Chunked (1024) 0.51 7206.1 2847.5 31.3 13599.8

Observations

  • +2–5% higher total token throughput across all concurrency levels.
  • 10–20% lower TTFT, especially under low concurrency.
  • Stable scaling up to concurrency = 8, maintaining high throughput without latency degradation.
  • Prefill efficiency improves since hybrid mode avoids unnecessary chunk splitting when decode is idle.

Example (QwQ-32B @ concurrency=8)

  • Input tokens: 272,789
  • Generated tokens: 7,981
  • Token throughput: 7206.13 tok/s
  • Request throughput: 0.51 req/s

Hybrid Chunked Prefill achieves higher throughput with equal or better latency, demonstrating its effectiveness on large-scale inference workloads.


Comment

This PR adds an adaptive prefill mechanism to vLLM that dynamically balances throughput and latency.
By intelligently enabling chunked prefill only when necessary, it significantly improves efficiency for models like QwQ-32B under real concurrent serving scenarios.


Essential Elements of an Effective PR Description Checklist
  • Purpose: Introduce hybrid prefill scheduling to optimize serving performance.
  • Test Plan: Detailed benchmark procedure and dataset.
  • Test Results: Comparative performance table and observations.
  • (Optional) Documentation update (supported_features.md).
  • (Optional) Release notes update for upcoming vLLM version.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a hybrid chunked prefill optimization, which is a great feature for improving performance. The implementation is mostly sound, but I've identified a critical bug in the configuration logic. The initialization for prefill_max_num_batched_tokens is incorrectly placed within a conditional block, which could lead to incorrect behavior if a user specifies max_num_batched_tokens. I've provided a detailed comment with a suggested fix to address this issue, which is crucial for the correctness of this new feature.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@Ther-LF
Copy link
Contributor Author

Ther-LF commented Oct 11, 2025

Code Review

This pull request introduces a hybrid chunked prefill optimization, which is a great feature for improving performance. The implementation is mostly sound, but I've identified a critical bug in the configuration logic. The initialization for prefill_max_num_batched_tokens is incorrectly placed within a conditional block, which could lead to incorrect behavior if a user specifies max_num_batched_tokens. I've provided a detailed comment with a suggested fix to address this issue, which is crucial for the correctness of this new feature.

The prefill_max_num_batched_tokens is unconditionally initialized at line 200, before the if self.max_num_batched_tokens is None: conditional block:

self.prefill_max_num_batched_tokens = max(self.max_model_len, DEFAULT_MAX_NUM_BATCHED_TOKENS)

@Ther-LF Ther-LF changed the title Optimize Prefill Phase: Add Hybrid Chunked Prefill Support [Feature] Optimize Prefill Phase: Add Hybrid Chunked Prefill Support Oct 11, 2025
@Ther-LF
Copy link
Contributor Author

Ther-LF commented Oct 16, 2025

@WoosukKwon could you take a look at this PR since you reviewed related changes before? Thanks!

@mergify
Copy link

mergify bot commented Oct 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ther-LF.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 16, 2025
@Ther-LF
Copy link
Contributor Author

Ther-LF commented Oct 17, 2025

@russellb could you take a look at this PR since you reviewed related changes before? Thanks!

@Ther-LF Ther-LF force-pushed the hybrid-chunked-prefill branch from 863e537 to a434a2f Compare October 17, 2025 06:26
@mergify mergify bot removed the needs-rebase label Oct 17, 2025
Copy link
Member

@hmellor hmellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the following before review:

  • new config args
  • pre-commit
  • dco
  • docs build

This config has no static default. If left unspecified by the user, it will
be set in `EngineArgs.create_engine_config` based on the usage context."""

prefill_max_num_batched_tokens: SkipValidation[int] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why SkipValidation? If it's because this will be set later by us if None you can do:

Suggested change
prefill_max_num_batched_tokens: SkipValidation[int] = None
prefill_max_num_batched_tokens: int = Field(default=None)

This will skip validateion for the default None but validate passed values.

"""If True, prefill requests can be chunked based
on the remaining max_num_batched_tokens."""

enable_hybrid_chunked_prefill: SkipValidation[bool] = None # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why SkipValidation? If it's because this will be set later by us if None you can do:

Suggested change
enable_hybrid_chunked_prefill: SkipValidation[bool] = None # type: ignore
enable_hybrid_chunked_prefill: bool = Field(default=None)

This will skip validateion for the default None but validate passed values.

@Ther-LF Ther-LF force-pushed the hybrid-chunked-prefill branch from 20306b3 to f8f9bee Compare October 17, 2025 10:06
@Ther-LF
Copy link
Contributor Author

Ther-LF commented Oct 17, 2025

@hmellor Thanks for the suggestion! I’ve updated both fields to use Field(default=None) and annotated them as int | None / bool | None, so defaults aren’t validated but user-provided values are. None will be resolved in EngineArgs.create_engine_config based on the usage context.

from pydantic import Field

prefill_max_num_batched_tokens: int | None = Field(default=None)
enable_hybrid_chunked_prefill: bool | None = Field(default=None)

@Ther-LF Ther-LF force-pushed the hybrid-chunked-prefill branch from 085cf83 to 1d03ed8 Compare November 4, 2025 15:24
@Ther-LF Ther-LF requested a review from hmellor November 4, 2025 16:18
@Ther-LF Ther-LF force-pushed the hybrid-chunked-prefill branch from bdb47d3 to 6cb2f2e Compare November 7, 2025 07:24
@Ther-LF Ther-LF force-pushed the hybrid-chunked-prefill branch from b6192ba to 62b0232 Compare November 10, 2025 12:36
@Ther-LF
Copy link
Contributor Author

Ther-LF commented Nov 10, 2025

Hi @hmellor, @njhill,

Could you please take a look at my latest changes when you have a moment? I’ve implemented the approach we discussed and would really appreciate any feedback on places where the code could be simplified or improved.

@Ther-LF Ther-LF force-pushed the hybrid-chunked-prefill branch from a942a0f to d520415 Compare November 11, 2025 07:09
@mergify
Copy link

mergify bot commented Nov 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Ther-LF.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 12, 2025
@mergify mergify bot removed the needs-rebase label Nov 12, 2025
@njhill
Copy link
Member

njhill commented Nov 12, 2025

@Ther-LF apologies I got sidetracked and then was sick the last couple of days. I will get back to it this week.

@Ther-LF
Copy link
Contributor Author

Ther-LF commented Nov 12, 2025

@njhill No worries at all—hope you feel better soon! Please take care.
If you have any suggestions or questions about the code, just let me know anytime and I’ll update it right away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build llama Related to Llama models multi-modality Related to multi-modality (#4194) v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants