-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[P/D]kv_output_aggregator support heterogeneous
#23917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the initialization of KVOutputAggregator to support heterogeneous configurations by querying the number of participants from the KV connector. The changes are well-structured, introducing a new method in the connector's base class and updating executors to use it. However, there is a critical issue in both MultiprocExecutor and RayDistributedExecutor where a None return from get_finished_count() is not handled, which would lead to a TypeError at runtime. I've provided suggestions to fix this by adding a fallback mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_finished_count() method on KVConnectorBase_V1 is defined to return Optional[int], and its base implementation returns None. The KVOutputAggregator constructor expects an int, so passing None to it will cause a TypeError. This is a critical issue that can lead to a runtime crash if a connector that does not override get_finished_count() is used. You should handle the None case, for example by falling back to self.parallel_config.world_size.
| def init_kv_output_aggregator(self) -> None: | |
| if has_kv_transfer_group(): | |
| kv_connector = get_kv_transfer_group() | |
| self.kv_output_aggregator = KVOutputAggregator( | |
| kv_connector.get_finished_count()) | |
| else: | |
| self.kv_output_aggregator = KVOutputAggregator( | |
| self.parallel_config.world_size) | |
| def init_kv_output_aggregator(self) -> None: | |
| world_size = self.parallel_config.world_size | |
| if has_kv_transfer_group(): | |
| kv_connector = get_kv_transfer_group() | |
| finished_count = kv_connector.get_finished_count() | |
| if finished_count is not None: | |
| world_size = finished_count | |
| self.kv_output_aggregator = KVOutputAggregator(world_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_finished_count() method on KVConnectorBase_V1 is defined to return Optional[int], and its base implementation returns None. The KVOutputAggregator constructor expects an int, so passing None to it will cause a TypeError. This is a critical issue that can lead to a runtime crash if a connector that does not override get_finished_count() is used. You should handle the None case, for example by falling back to self.parallel_config.world_size.
| def init_kv_output_aggregator(self) -> None: | |
| if has_kv_transfer_group(): | |
| kv_connector = get_kv_transfer_group() | |
| self.kv_output_aggregator = KVOutputAggregator( | |
| kv_connector.get_finished_count()) | |
| else: | |
| self.kv_output_aggregator = KVOutputAggregator( | |
| self.parallel_config.world_size) | |
| def init_kv_output_aggregator(self) -> None: | |
| world_size = self.parallel_config.world_size | |
| if has_kv_transfer_group(): | |
| kv_connector = get_kv_transfer_group() | |
| finished_count = kv_connector.get_finished_count() | |
| if finished_count is not None: | |
| world_size = finished_count | |
| self.kv_output_aggregator = KVOutputAggregator(world_size) |
fa99f75 to
da4ed3d
Compare
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
7a0d2b4 to
1b59338
Compare
70f10c8 to
7073da9
Compare
### What this PR does / why we need it? In vllm version 0.10.1, a new KVOutputAggregator was added to the executor, moving aggregation to the executor(vllm-project/vllm#19555). This caused mooncake_connector to break. This change aims to fix this bug and also adds a policy to forcibly release the KV cache when the prefill node times out. This PR is currently linked to a PR in vllm (vllm-project/vllm#23917). The vllm PR aims to modify the finish and send count confirmation in heterogeneous TP situations. The reason for deleting many UTs is that a lot of communication codes have been deleted, so the UT as a whole will appear more concise. - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@fa4311d --------- Signed-off-by: baxingpiaochong <[email protected]>
NickLucche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for contributing to this @LCAIZJ !
One big issue with this interface is that heterogeneous TP is, for some connectors like Nixl but I would argue more broadly for discovery-based ones, a "runtime property", as the connector only finds out about the heterogeneous state after the first handshake (assuming "static" P configuration on the other side).
Hence here we wouldn't be able to return a sensible get_finished_count at init time.
Successfully processed 10,000 benchmark prompts with 4K input tokens and 1.5K output tokens per request.
What connector did you use for this, as I don't see any concrete get_finished_count impl on this PR?
Also, it would be nice if we could add some minimal unit tests, perhaps testing the example you reported.
PS
When the TP size for P exceeds that of D, the current mechanism of kv_output_aggregator fails to function correctly
This is also partly the reason why in disagg PD settings with NixlConnector we do not allow P TP size to exceed D's https://github.com/vllm-project/vllm/blob/main/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py#L909.
But I would like to understand, how popular/important is this MLA+P TP>D PT use-case, rather than assuming MoE and going wide EP?
I feel maintaining both P-TP</>D-TP in NixlConnector might complicate code quite a bit.
Nothing against supporting that for other connectors though ofc.
Regarding the design of which connector might involve P's TP size exceeding D's TP size as you mentioned, please refer to the implementation in this PR and file: https://github.com/vllm-project/vllm-ascend/pull/2664/files#diff-033f9a4af4a59c8ca0ed782d5821675ffcc15f64625bc37afa95a886edb373d1. From my personal perspective, it would be ideal for the connector design to be compatible with various TP_SIZE ratios (relying on long-term evolution). Currently, the connector mentioned above is primarily used in the vllm-ascend community. As a side note, if we could write the TP information of P and D into kv_connector_extra_config within kv_transfer_config, then read it in the executor, calculate the actual target_count (the true completion count), and pass it during aggregator initialization—this might also be a viable implementation approach. |
Hi @NickLucche , Thank you for your review comments. My responses to your questions are as follows:
|
@NickLucche For MLA, people usually do Decode TP=1, and Prefill TP > 1. |
Indeed |
|
Apologies for the delay, let me get back to this PR tomorrow. |
kv_output_aggregator support heterogeneous
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: LCAIZJ <[email protected]>
Signed-off-by: LCAIZJ <[email protected]>
Signed-off-by: LCAIZJ <[email protected]>
Signed-off-by: LCAIZJ <[email protected]>
Signed-off-by: LCAIZJ <[email protected]>
Signed-off-by: LCAIZJ <[email protected]>
Signed-off-by: LCAIZJ <[email protected]>
414ec6b to
1069273
Compare
|
@NickLucche All CI pipelines have passed. Would you mind reviewing the PR again when you have time? |
|
This is likely the most minimal enabling set of changes now, thank you! |
@NickLucche Thank you for your feedback. If everything looks good, could you please approve this PR? It's currently configured to require review approval before it can be merged automatically. |
|
Apologies, I had missed the force push |
Signed-off-by: LCAIZJ <[email protected]> Co-authored-by: leichao.lc <[email protected]> Signed-off-by: bbartels <[email protected]>
### What this PR does / why we need it? In vllm version 0.10.1, a new KVOutputAggregator was added to the executor, moving aggregation to the executor(vllm-project/vllm#19555). This caused mooncake_connector to break. This change aims to fix this bug and also adds a policy to forcibly release the KV cache when the prefill node times out. This PR is currently linked to a PR in vllm (vllm-project/vllm#23917). The vllm PR aims to modify the finish and send count confirmation in heterogeneous TP situations. The reason for deleting many UTs is that a lot of communication codes have been deleted, so the UT as a whole will appear more concise. - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@fa4311d --------- Signed-off-by: baxingpiaochong <[email protected]> Signed-off-by: offline0806 <[email protected]>
### What this PR does / why we need it? In vllm version 0.10.1, a new KVOutputAggregator was added to the executor, moving aggregation to the executor(vllm-project/vllm#19555). This caused mooncake_connector to break. This change aims to fix this bug and also adds a policy to forcibly release the KV cache when the prefill node times out. This PR is currently linked to a PR in vllm (vllm-project/vllm#23917). The vllm PR aims to modify the finish and send count confirmation in heterogeneous TP situations. The reason for deleting many UTs is that a lot of communication codes have been deleted, so the UT as a whole will appear more concise. - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@fa4311d --------- Signed-off-by: baxingpiaochong <[email protected]>
Signed-off-by: LCAIZJ <[email protected]> Co-authored-by: leichao.lc <[email protected]>
### What this PR does / why we need it? In vllm version 0.10.1, a new KVOutputAggregator was added to the executor, moving aggregation to the executor(vllm-project/vllm#19555). This caused mooncake_connector to break. This change aims to fix this bug and also adds a policy to forcibly release the KV cache when the prefill node times out. This PR is currently linked to a PR in vllm (vllm-project/vllm#23917). The vllm PR aims to modify the finish and send count confirmation in heterogeneous TP situations. The reason for deleting many UTs is that a lot of communication codes have been deleted, so the UT as a whole will appear more concise. - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@fa4311d --------- Signed-off-by: baxingpiaochong <[email protected]>
Signed-off-by: LCAIZJ <[email protected]> Co-authored-by: leichao.lc <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Signed-off-by: LCAIZJ <[email protected]> Co-authored-by: leichao.lc <[email protected]>
### What this PR does / why we need it? In vllm version 0.10.1, a new KVOutputAggregator was added to the executor, moving aggregation to the executor(vllm-project/vllm#19555). This caused mooncake_connector to break. This change aims to fix this bug and also adds a policy to forcibly release the KV cache when the prefill node times out. This PR is currently linked to a PR in vllm (vllm-project/vllm#23917). The vllm PR aims to modify the finish and send count confirmation in heterogeneous TP situations. The reason for deleting many UTs is that a lot of communication codes have been deleted, so the UT as a whole will appear more concise. - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@fa4311d --------- Signed-off-by: baxingpiaochong <[email protected]>
Signed-off-by: LCAIZJ <[email protected]> Co-authored-by: leichao.lc <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Purpose
Test Plan
Model:DeepSeek-R1
Distributed Strategy:1P(TP16)1D(DP4、TP4)
Test Result
Successfully processed 10,000 benchmark prompts with 4K input tokens and 1.5K output tokens per request.

Since the primary goal was to verify overall functionality, we did not enable many optimization features during benchmark execution. As a result, the TTFT and TPOT metrics are suboptimal, though the verification of the core functionality was successfully completed.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.