-
Notifications
You must be signed in to change notification settings - Fork 661
[Feat] shared expert dp for deepseek_mtp #3811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for shared expert data parallelism for DeepSeek models, a valuable enhancement. The changes span configuration, attention mechanisms, model layers, and custom operators to accommodate sequence parallelism. The strategy to decouple this feature from the torchair framework is logical and consistently implemented. My review has uncovered a critical bug in vllm_ascend/ascend_config.py where a crucial dependency check is flawed, which could result in runtime issues. I have included a specific comment with a recommended solution for this issue. The rest of the changes seem to be well-aligned with the PR's objectives.
vllm_ascend/ascend_config.py
Outdated
| if not enable_sp: | ||
| self.enable_shared_expert_dp = False | ||
| logger.info( | ||
| f"enable_shared_expert_dp is {self.enable_shared_expert_dp}" | ||
| "Enable enable_shared_expert_dp must enable_sp" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if not enable_sp: on line 73 is incorrect. enable_sp is a function object, which always evaluates to True in a boolean context. Therefore, not enable_sp is always False, and the code block that disables enable_shared_expert_dp is never executed. This means the check to ensure enable_sp is active when using shared expert DP is bypassed, which can lead to runtime errors or incorrect behavior. The function should be called: if not enable_sp():.
Additionally, the log message on lines 75-78 is malformed due to a missing space between concatenated strings and should use logger.warning to alert the user that their configuration is being overridden.
| if not enable_sp: | |
| self.enable_shared_expert_dp = False | |
| logger.info( | |
| f"enable_shared_expert_dp is {self.enable_shared_expert_dp}" | |
| "Enable enable_shared_expert_dp must enable_sp" | |
| ) | |
| if not enable_sp(): | |
| self.enable_shared_expert_dp = False | |
| logger.warning( | |
| "Disabling 'enable_shared_expert_dp' because 'enable_sp' is not active. " | |
| "'enable_sp' is required for shared expert data parallelism." | |
| ) |
9526aeb to
02f4ddf
Compare
| if (self.enable_mlapo and | ||
| (attn_metadata is None or not forward_context.with_prefill)): | ||
| hidden_states = torch.ops.vllm.maybe_all_gather_and_maybe_unpad( | ||
| hidden_states.contiguous(), need_gather_q_kv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we have to all gather hidden_states?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all_gather is performed before Attention because the fusion operator MLAPO does not include the all_gather operation, so it needs to be done in advance. And it has little impact on MLA.
vllm_ascend/ascend_config.py
Outdated
| from vllm_ascend.utils import enable_sp | ||
| if not enable_sp(vllm_config): | ||
| self.enable_shared_expert_dp = False | ||
| logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单算子下测试 P节点 性能功能
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the scenario where only the shared expert DP is enabled under node P without enabling SP is not considered.
vllm_ascend/utils.py
Outdated
| # TODO remove it after vllm has this func | ||
| def shared_expert_dp_enabled() -> bool: | ||
| return get_ascend_config().enable_shared_expert_dp or enable_sp() | ||
| return get_ascend_config().enable_shared_expert_dp and enable_sp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete shared_expert_dp_enabled maybe better, as there is currently no scenario enable_sp is True and enable_shared_expert_dp is False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, due to limitations in the communication process, enabling SP requires enabling the shared expert DP. An assert restriction will be added to the configuration parameters.
| inputs_embeds[positions == 0] = 0 | ||
| inputs_embeds = self.enorm(inputs_embeds) | ||
| previous_hidden_states = self.hnorm(previous_hidden_states) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以参考主模型的sp切分
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The modifications to deepseek_mtp have been made using a non-patch method when enabling shared expert DP.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
ffbd40c to
6a51f45
Compare
whx-sjtu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
a654ff3 to
b606b0a
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
14049e7 to
e8336d0
Compare
0159a1c to
4d862d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. And there are two questions:
enable_shared_expert_dpandflash_commshould significantly reduce the TTFT. Could you please update your experimental results with 1P1D or PD-mix deployments with these features?- Please evaluate whether we can arrange these features properly? (
enable_shared_expert_dp,flash_comm, andmultistream_overlap_shared_expert)
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
3bab4ac to
7cc6c39
Compare
cd5522f to
063e683
Compare
|
|
||
| if not _ENABLE_SP and enable_shared_expert_dp: | ||
| _ENABLE_SP = True | ||
| logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.warning
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: chenmenglong <[email protected]>
Signed-off-by: zengran <[email protected]>
vllm_ascend/ascend_config.py
Outdated
| self.enable_shared_expert_dp = False | ||
| logger.info( | ||
| f"enable_shared_expert_dp is {self.enable_shared_expert_dp}" | ||
| "Enable enable_shared_expert_dp must enable sp") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use f string in logger, please use % string, fstring will cause performance problem
| if x.size(0) != residual.size(0): | ||
| sp_enabled = forward_context.sp_enabled | ||
| assert sp_enabled is True, ("Currently, this situation only occurs " | ||
| "when sp is enabled") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert sp_enabled, ("Currently, this situation only occurs " "when sp is enabled") will be good
### What this PR does / why we need it? Support shared expert DP for deepseek_mtp feature. `shared_expert_dp` requires `SP==True`, with corresponding parameter restrictions. Previously, due to the coupling between `shared_expert_dp` and torchair, and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of deepseek_mtp was temporarily removed. Currently, by performing the `reduce_scatter` on the input of deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the dimensions of `input_embedding`, and then perform the `all_gather` on the output of mtp. ### How was this patch tested? baseline: <img width="1184" height="692" alt="image" src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3" /> enable shared_expert_dp and multistream_overlap_shared_expert: <img width="1167" height="687" alt="image" src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc" /> TPOT: 48ms -> 45.4ms Average TPS per rank: 117.6 -> 126.1 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: chenmenglong <[email protected]> Signed-off-by: zengran <[email protected]> Co-authored-by: zengran <[email protected]>
### What this PR does / why we need it? Support shared expert DP for deepseek_mtp feature. `shared_expert_dp` requires `SP==True`, with corresponding parameter restrictions. Previously, due to the coupling between `shared_expert_dp` and torchair, and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of deepseek_mtp was temporarily removed. Currently, by performing the `reduce_scatter` on the input of deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the dimensions of `input_embedding`, and then perform the `all_gather` on the output of mtp. ### How was this patch tested? baseline: <img width="1184" height="692" alt="image" src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3" /> enable shared_expert_dp and multistream_overlap_shared_expert: <img width="1167" height="687" alt="image" src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc" /> TPOT: 48ms -> 45.4ms Average TPS per rank: 117.6 -> 126.1 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: chenmenglong <[email protected]> Signed-off-by: zengran <[email protected]> Co-authored-by: zengran <[email protected]> Signed-off-by: Che Ruan <[email protected]>
### What this PR does / why we need it? Support shared expert DP for deepseek_mtp feature. `shared_expert_dp` requires `SP==True`, with corresponding parameter restrictions. Previously, due to the coupling between `shared_expert_dp` and torchair, and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of deepseek_mtp was temporarily removed. Currently, by performing the `reduce_scatter` on the input of deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the dimensions of `input_embedding`, and then perform the `all_gather` on the output of mtp. ### How was this patch tested? baseline: <img width="1184" height="692" alt="image" src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3" /> enable shared_expert_dp and multistream_overlap_shared_expert: <img width="1167" height="687" alt="image" src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc" /> TPOT: 48ms -> 45.4ms Average TPS per rank: 117.6 -> 126.1 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: chenmenglong <[email protected]> Signed-off-by: zengran <[email protected]> Co-authored-by: zengran <[email protected]> Signed-off-by: Che Ruan <[email protected]>
### What this PR does / why we need it? Support shared expert DP for deepseek_mtp feature. `shared_expert_dp` requires `SP==True`, with corresponding parameter restrictions. Previously, due to the coupling between `shared_expert_dp` and torchair, and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of deepseek_mtp was temporarily removed. Currently, by performing the `reduce_scatter` on the input of deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the dimensions of `input_embedding`, and then perform the `all_gather` on the output of mtp. ### How was this patch tested? baseline: <img width="1184" height="692" alt="image" src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3" /> enable shared_expert_dp and multistream_overlap_shared_expert: <img width="1167" height="687" alt="image" src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc" /> TPOT: 48ms -> 45.4ms Average TPS per rank: 117.6 -> 126.1 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: chenmenglong <[email protected]> Signed-off-by: zengran <[email protected]> Co-authored-by: zengran <[email protected]>
### What this PR does / why we need it? Support shared expert DP for deepseek_mtp feature. `shared_expert_dp` requires `SP==True`, with corresponding parameter restrictions. Previously, due to the coupling between `shared_expert_dp` and torchair, and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of deepseek_mtp was temporarily removed. Currently, by performing the `reduce_scatter` on the input of deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the dimensions of `input_embedding`, and then perform the `all_gather` on the output of mtp. ### How was this patch tested? baseline: <img width="1184" height="692" alt="image" src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3" /> enable shared_expert_dp and multistream_overlap_shared_expert: <img width="1167" height="687" alt="image" src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc" /> TPOT: 48ms -> 45.4ms Average TPS per rank: 117.6 -> 126.1 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: chenmenglong <[email protected]> Signed-off-by: zengran <[email protected]> Co-authored-by: zengran <[email protected]> Signed-off-by: tanqingshan (A) <[email protected]>
### What this PR does / why we need it? Support shared expert DP for deepseek_mtp feature. `shared_expert_dp` requires `SP==True`, with corresponding parameter restrictions. Previously, due to the coupling between `shared_expert_dp` and torchair, and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of deepseek_mtp was temporarily removed. Currently, by performing the `reduce_scatter` on the input of deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the dimensions of `input_embedding`, and then perform the `all_gather` on the output of mtp. ### How was this patch tested? baseline: <img width="1184" height="692" alt="image" src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3" /> enable shared_expert_dp and multistream_overlap_shared_expert: <img width="1167" height="687" alt="image" src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc" /> TPOT: 48ms -> 45.4ms Average TPS per rank: 117.6 -> 126.1 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: chenmenglong <[email protected]> Signed-off-by: zengran <[email protected]> Co-authored-by: zengran <[email protected]>
### What this PR does / why we need it? Support shared expert DP for deepseek_mtp feature. `shared_expert_dp` requires `SP==True`, with corresponding parameter restrictions. Previously, due to the coupling between `shared_expert_dp` and torchair, and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of deepseek_mtp was temporarily removed. Currently, by performing the `reduce_scatter` on the input of deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the dimensions of `input_embedding`, and then perform the `all_gather` on the output of mtp. ### How was this patch tested? baseline: <img width="1184" height="692" alt="image" src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3" /> enable shared_expert_dp and multistream_overlap_shared_expert: <img width="1167" height="687" alt="image" src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc" /> TPOT: 48ms -> 45.4ms Average TPS per rank: 117.6 -> 126.1 - vLLM version: v0.11.2 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2 --------- Signed-off-by: chenmenglong <[email protected]> Signed-off-by: zengran <[email protected]> Co-authored-by: zengran <[email protected]>



What this PR does / why we need it?
Support shared expert DP for deepseek_mtp feature.
shared_expert_dprequiresSP==True, with corresponding parameter restrictions.Previously, due to the coupling between
shared_expert_dpand torchair, and the removal ofdeepseek_mtpin vllm_ascend, shared expert dp of deepseek_mtp was temporarily removed.Currently, by performing the
reduce_scatteron the input of deepssek_mtp inmtp_proposer.py, we ensure that it matches the dimensions ofinput_embedding, and then perform theall_gatheron the output of mtp.How was this patch tested?
baseline:

enable shared_expert_dp and multistream_overlap_shared_expert:

TPOT: 48ms -> 45.4ms
Average TPS per rank: 117.6 -> 126.1