-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
[NIXL] Support P tensor-parallel-size > D tensor-parallel-size #27274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
cc @GuanLuo let me know if this PR meets the expected set of features you aimed to get with your work. Thank you! |
| class DummyModelRunnerOutput(ModelRunnerOutput): | ||
| def __init__( | ||
| self, | ||
| finished_sending: set[str] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore file, to be rebased once #26734 lands
| """ | ||
| Get the count of requests expected to complete send/receive operations | ||
| via this connector. | ||
| via this connector. This method is used to initialize the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore, to be rebased once #26734 lands
| tp_ratio, | ||
| ) | ||
|
|
||
| ### (Optional) Register local agent memory regions. MLA is not split. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gist of the PR
| # on notification so that dst worker can wait before freeing blocks. | ||
| tp_ratio = self.kv_topo.tp_ratio_from_engine_id(dst_engine_id) | ||
| # Cap to 1 when P TP > D TP: only a single rank will read from remote. | ||
| tp_ratio = max(1, self.kv_topo.tp_ratio_from_engine_id(dst_engine_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to have P only wait for 1 request instead of -tp_ratio
| # SPDX-License-Identifier: Apache-2.0 | ||
| # SPDX-FileCopyrightText: Copyright contributors to the vLLM project | ||
| """ | ||
| KV cache helper for store. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore file, to be rebased once #26734 lands
vllm/executor/executor_base.py
Outdated
|
|
||
| import asyncio | ||
| import time | ||
| from abc import ABC, abstractmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore file, to be rebased once #26734 lands
| include_finished_set=vllm_config.parallel_config.data_parallel_size > 1, | ||
| log_stats=self.log_stats, | ||
| block_size=scheduler_block_size, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore, to be rebased once #26734 lands
| class KVConnectorOutput: | ||
| # [req_ids] | ||
| finished_sending: set[str] | None = None | ||
| finished_recving: set[str] | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignore, to be rebased once #26734 lands
|
This pull request has merge conflicts that must be resolved before it can be |
054e7ff to
77577b0
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
54cf766 to
78ef532
Compare
|
PR's now ready for review! |
|
cc @xuechendi for xpu |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@zhenwei-intel , please help to review, thx |
78ef532 to
2d08fa7
Compare
2d08fa7 to
4feab2c
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
Signed-off-by: NickLucche <[email protected]>
4feab2c to
3899d23
Compare
| self.num_layers = 0 | ||
|
|
||
| # nixl_prepped_dlist_handle. | ||
| self.src_xfer_side_handle: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dropped default self.src_xfer_side_handle in favor of
self.src_xfer_handles_by_block_size[self.block_size]
| if self.use_mla and tp_ratio < 0: | ||
| # ..but we still need to notify the other remote ranks that we | ||
| # have the blocks we need so they can update the request state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important mla logic
| self.kv_caches_base_addr: dict[EngineId, list[int]] = {} | ||
| self.device_id: int = 0 | ||
| # Current rank may pull from multiple remote TP workers. | ||
| self.kv_caches_base_addr: defaultdict[EngineId, dict[int, list[int]]] = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be helpful with a comment to explain the leveled-dict, ex:
# EngineId, dict[int, list[int]] -> engine_id, tp_rank, base_addr_for_layer
|
PR is verified with heter_block_size test, and it looks good. |
Overview
This PR addresses the following case, P tensor-parallel-size > D tensor-parallel-size.
I think it helps to differentiate two main cases
MLA
For MLA model, the workflow is easier: each D worker reads from some other single P worker (fan-out reads to avoid all reading from same remote), as MLA cache is duplicated. Some P workers will not be read from at all.
Mind that this also holds for the DP/EP deployment, where TP size on D will often be 1!
From PR #23917, which also serves as good use-case. Btw as explained in that PR, the number of requests to "expect" is indeed the number of remote instances reading from P.
The main issue to implement that in Nixl is that each P worker will track requests as they come in (
_reqs_to_send,_reqs_to_process) and those structures are only cleared properly when a read is detected (o/w timeouts would be raised on P).To address that, I am allowing MLA D ranks to only execute one transfer, but notifying all affected remote that the read is completed (sending multiple nixl notifs).
cc @njhill @markmc
Dense
For dense models, every D worker will read from n P workers to re-compose its own KV cache, where n is referred to as
tp_ratioin code.This is possible because number of heads on P is H/n that of D's, so you can efficiently read into D's cache using HND layout. That is, in memory, you're just laying out flat ND tensors H/n , n times
Side note: current design is flexible and allows for dynamic discovery of remotes with different tp_sizes. However this is not a feature that is currently supported, but it helps to take into account when considering impl choices. It's more of an optional route I'd like to keep open.
Changes
The main change this PR needs to allow is for a D worker to read from multiple P's.
Practical edits this PR introduces to do so:
src_xfer_side_chunked_handles: local regions need to be split differently based on how many remotes we want to read from. This is prepared during handshake, once .[engine_id][rank_no]to accomodate the aboveget_target_remote->get_target_remotesfor the same reason, + a bunch of for loops over its resulttp_ratioextension to indicate remote P size greater than D_pop_done_transfersHow to test
And check out tp_config_sweep_accuracy with config:
TODO
Coming soon to this PR:
[ ] On MLA with DP/EP, avoid having all workers read from same remotedeferringIt does NOT support replicated KV heads scenario,
tp_size>num_heads. This is definitely doable, just I believe on weak demand atm so we can postpone it.