Skip to content

Conversation

@dragondream-chen
Copy link
Contributor

@dragondream-chen dragondream-chen commented Oct 28, 2025

What this PR does / why we need it?

Support shared expert DP for deepseek_mtp feature.
shared_expert_dp requires SP==True, with corresponding parameter restrictions.
Previously, due to the coupling between shared_expert_dp and torchair, and the removal of deepseek_mtp in vllm_ascend, shared expert dp of deepseek_mtp was temporarily removed.
Currently, by performing the reduce_scatter on the input of deepssek_mtp in mtp_proposer.py, we ensure that it matches the dimensions of input_embedding, and then perform the all_gather on the output of mtp.

How was this patch tested?

baseline:
image

enable shared_expert_dp and multistream_overlap_shared_expert:
image

TPOT: 48ms -> 45.4ms
Average TPS per rank: 117.6 -> 126.1

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for shared expert data parallelism for DeepSeek models, a valuable enhancement. The changes span configuration, attention mechanisms, model layers, and custom operators to accommodate sequence parallelism. The strategy to decouple this feature from the torchair framework is logical and consistently implemented. My review has uncovered a critical bug in vllm_ascend/ascend_config.py where a crucial dependency check is flawed, which could result in runtime issues. I have included a specific comment with a recommended solution for this issue. The rest of the changes seem to be well-aligned with the PR's objectives.

Comment on lines 73 to 78
if not enable_sp:
self.enable_shared_expert_dp = False
logger.info(
f"enable_shared_expert_dp is {self.enable_shared_expert_dp}"
"Enable enable_shared_expert_dp must enable_sp"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The condition if not enable_sp: on line 73 is incorrect. enable_sp is a function object, which always evaluates to True in a boolean context. Therefore, not enable_sp is always False, and the code block that disables enable_shared_expert_dp is never executed. This means the check to ensure enable_sp is active when using shared expert DP is bypassed, which can lead to runtime errors or incorrect behavior. The function should be called: if not enable_sp():.

Additionally, the log message on lines 75-78 is malformed due to a missing space between concatenated strings and should use logger.warning to alert the user that their configuration is being overridden.

Suggested change
if not enable_sp:
self.enable_shared_expert_dp = False
logger.info(
f"enable_shared_expert_dp is {self.enable_shared_expert_dp}"
"Enable enable_shared_expert_dp must enable_sp"
)
if not enable_sp():
self.enable_shared_expert_dp = False
logger.warning(
"Disabling 'enable_shared_expert_dp' because 'enable_sp' is not active. "
"'enable_sp' is required for shared expert data parallelism."
)

@dragondream-chen dragondream-chen force-pushed the main branch 2 times, most recently from 9526aeb to 02f4ddf Compare October 28, 2025 06:24
if (self.enable_mlapo and
(attn_metadata is None or not forward_context.with_prefill)):
hidden_states = torch.ops.vllm.maybe_all_gather_and_maybe_unpad(
hidden_states.contiguous(), need_gather_q_kv)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we have to all gather hidden_states?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_gather is performed before Attention because the fusion operator MLAPO does not include the all_gather operation, so it needs to be done in advance. And it has little impact on MLA.

from vllm_ascend.utils import enable_sp
if not enable_sp(vllm_config):
self.enable_shared_expert_dp = False
logger.info(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单算子下测试 P节点 性能功能

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the scenario where only the shared expert DP is enabled under node P without enabling SP is not considered.

# TODO remove it after vllm has this func
def shared_expert_dp_enabled() -> bool:
return get_ascend_config().enable_shared_expert_dp or enable_sp()
return get_ascend_config().enable_shared_expert_dp and enable_sp()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete shared_expert_dp_enabled maybe better, as there is currently no scenario enable_sp is True and enable_shared_expert_dp is False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, due to limitations in the communication process, enabling SP requires enabling the shared expert DP. An assert restriction will be added to the configuration parameters.

inputs_embeds[positions == 0] = 0
inputs_embeds = self.enorm(inputs_embeds)
previous_hidden_states = self.hnorm(previous_hidden_states)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以参考主模型的sp切分

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The modifications to deepseek_mtp have been made using a non-patch method when enabling shared expert DP.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@whx-sjtu whx-sjtu added ready read for review ready-for-test start test by label for PR labels Nov 10, 2025
@dragondream-chen dragondream-chen changed the title [Feat] shared expert dp for deepseek and deepseek_mtp [Feat] shared expert dp for deepseek_mtp Nov 11, 2025
Copy link
Collaborator

@whx-sjtu whx-sjtu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Copy link
Collaborator

@linfeng-yuan linfeng-yuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. And there are two questions:

  1. enable_shared_expert_dp and flash_comm should significantly reduce the TTFT. Could you please update your experimental results with 1P1D or PD-mix deployments with these features?
  2. Please evaluate whether we can arrange these features properly? (enable_shared_expert_dp, flash_comm, and multistream_overlap_shared_expert)

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@zengzengran
Copy link
Contributor

Resolve conflicts and verification outcome:
1.Single-node hybrid deployment: "multistream_overlap_shared_expert":false, "enable_shared_expert_dp": false
6

2.Single-node hybrid deployment: "multistream_overlap_shared_expert":true, "enable_shared_expert_dp": true
7
8

@zengzengran zengzengran force-pushed the main branch 3 times, most recently from cd5522f to 063e683 Compare November 29, 2025 06:44

if not _ENABLE_SP and enable_shared_expert_dp:
_ENABLE_SP = True
logger.info(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.warning

@github-actions
Copy link

github-actions bot commented Dec 1, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

self.enable_shared_expert_dp = False
logger.info(
f"enable_shared_expert_dp is {self.enable_shared_expert_dp}"
"Enable enable_shared_expert_dp must enable sp")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use f string in logger, please use % string, fstring will cause performance problem

if x.size(0) != residual.size(0):
sp_enabled = forward_context.sp_enabled
assert sp_enabled is True, ("Currently, this situation only occurs "
"when sp is enabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert sp_enabled, ("Currently, this situation only occurs " "when sp is enabled") will be good

@yiz-liu yiz-liu merged commit 143e1f4 into vllm-project:main Dec 1, 2025
22 checks passed
ChenCangtao pushed a commit to ChenCangtao/vllm-ascend that referenced this pull request Dec 3, 2025
### What this PR does / why we need it?
Support shared expert DP for deepseek_mtp feature. 
`shared_expert_dp` requires `SP==True`, with corresponding parameter
restrictions.
Previously, due to the coupling between `shared_expert_dp` and torchair,
and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of
deepseek_mtp was temporarily removed.
Currently, by performing the `reduce_scatter` on the input of
deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the
dimensions of `input_embedding`, and then perform the `all_gather` on
the output of mtp.

### How was this patch tested?
baseline:
<img width="1184" height="692" alt="image"
src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3"
/>

enable shared_expert_dp and multistream_overlap_shared_expert:
<img width="1167" height="687" alt="image"
src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc"
/>

TPOT: 48ms -> 45.4ms
Average TPS per rank: 117.6 -> 126.1


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: chenmenglong <[email protected]>
Signed-off-by: zengran <[email protected]>
Co-authored-by: zengran <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
### What this PR does / why we need it?
Support shared expert DP for deepseek_mtp feature.
`shared_expert_dp` requires `SP==True`, with corresponding parameter
restrictions.
Previously, due to the coupling between `shared_expert_dp` and torchair,
and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of
deepseek_mtp was temporarily removed.
Currently, by performing the `reduce_scatter` on the input of
deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the
dimensions of `input_embedding`, and then perform the `all_gather` on
the output of mtp.

### How was this patch tested?
baseline:
<img width="1184" height="692" alt="image"
src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3"
/>

enable shared_expert_dp and multistream_overlap_shared_expert:
<img width="1167" height="687" alt="image"
src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc"
/>

TPOT: 48ms -> 45.4ms
Average TPS per rank: 117.6 -> 126.1

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: chenmenglong <[email protected]>
Signed-off-by: zengran <[email protected]>
Co-authored-by: zengran <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 4, 2025
### What this PR does / why we need it?
Support shared expert DP for deepseek_mtp feature.
`shared_expert_dp` requires `SP==True`, with corresponding parameter
restrictions.
Previously, due to the coupling between `shared_expert_dp` and torchair,
and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of
deepseek_mtp was temporarily removed.
Currently, by performing the `reduce_scatter` on the input of
deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the
dimensions of `input_embedding`, and then perform the `all_gather` on
the output of mtp.

### How was this patch tested?
baseline:
<img width="1184" height="692" alt="image"
src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3"
/>

enable shared_expert_dp and multistream_overlap_shared_expert:
<img width="1167" height="687" alt="image"
src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc"
/>

TPOT: 48ms -> 45.4ms
Average TPS per rank: 117.6 -> 126.1

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: chenmenglong <[email protected]>
Signed-off-by: zengran <[email protected]>
Co-authored-by: zengran <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
Meihan-chen pushed a commit to Meihan-chen/vllm-ascend that referenced this pull request Dec 5, 2025
### What this PR does / why we need it?
Support shared expert DP for deepseek_mtp feature. 
`shared_expert_dp` requires `SP==True`, with corresponding parameter
restrictions.
Previously, due to the coupling between `shared_expert_dp` and torchair,
and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of
deepseek_mtp was temporarily removed.
Currently, by performing the `reduce_scatter` on the input of
deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the
dimensions of `input_embedding`, and then perform the `all_gather` on
the output of mtp.

### How was this patch tested?
baseline:
<img width="1184" height="692" alt="image"
src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3"
/>

enable shared_expert_dp and multistream_overlap_shared_expert:
<img width="1167" height="687" alt="image"
src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc"
/>

TPOT: 48ms -> 45.4ms
Average TPS per rank: 117.6 -> 126.1


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: chenmenglong <[email protected]>
Signed-off-by: zengran <[email protected]>
Co-authored-by: zengran <[email protected]>
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 9, 2025
### What this PR does / why we need it?
Support shared expert DP for deepseek_mtp feature.
`shared_expert_dp` requires `SP==True`, with corresponding parameter
restrictions.
Previously, due to the coupling between `shared_expert_dp` and torchair,
and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of
deepseek_mtp was temporarily removed.
Currently, by performing the `reduce_scatter` on the input of
deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the
dimensions of `input_embedding`, and then perform the `all_gather` on
the output of mtp.

### How was this patch tested?
baseline:
<img width="1184" height="692" alt="image"
src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3"
/>

enable shared_expert_dp and multistream_overlap_shared_expert:
<img width="1167" height="687" alt="image"
src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc"
/>

TPOT: 48ms -> 45.4ms
Average TPS per rank: 117.6 -> 126.1

- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: chenmenglong <[email protected]>
Signed-off-by: zengran <[email protected]>
Co-authored-by: zengran <[email protected]>
Signed-off-by: tanqingshan (A) <[email protected]>
@linfeng-yuan linfeng-yuan mentioned this pull request Dec 9, 2025
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 10, 2025
### What this PR does / why we need it?
Support shared expert DP for deepseek_mtp feature. 
`shared_expert_dp` requires `SP==True`, with corresponding parameter
restrictions.
Previously, due to the coupling between `shared_expert_dp` and torchair,
and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of
deepseek_mtp was temporarily removed.
Currently, by performing the `reduce_scatter` on the input of
deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the
dimensions of `input_embedding`, and then perform the `all_gather` on
the output of mtp.

### How was this patch tested?
baseline:
<img width="1184" height="692" alt="image"
src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3"
/>

enable shared_expert_dp and multistream_overlap_shared_expert:
<img width="1167" height="687" alt="image"
src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc"
/>

TPOT: 48ms -> 45.4ms
Average TPS per rank: 117.6 -> 126.1


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: chenmenglong <[email protected]>
Signed-off-by: zengran <[email protected]>
Co-authored-by: zengran <[email protected]>
Mercykid-bash pushed a commit to Mercykid-bash/vllm-ascend that referenced this pull request Dec 10, 2025
### What this PR does / why we need it?
Support shared expert DP for deepseek_mtp feature. 
`shared_expert_dp` requires `SP==True`, with corresponding parameter
restrictions.
Previously, due to the coupling between `shared_expert_dp` and torchair,
and the removal of `deepseek_mtp` in vllm_ascend, shared expert dp of
deepseek_mtp was temporarily removed.
Currently, by performing the `reduce_scatter` on the input of
deepssek_mtp in `mtp_proposer.py`, we ensure that it matches the
dimensions of `input_embedding`, and then perform the `all_gather` on
the output of mtp.

### How was this patch tested?
baseline:
<img width="1184" height="692" alt="image"
src="https://github.com/user-attachments/assets/9680d53a-7b1d-481a-accc-b8f3dae2b9e3"
/>

enable shared_expert_dp and multistream_overlap_shared_expert:
<img width="1167" height="687" alt="image"
src="https://github.com/user-attachments/assets/2531d06b-dfda-4e24-8628-6f4b0f677ddc"
/>

TPOT: 48ms -> 45.4ms
Average TPS per rank: 117.6 -> 126.1


- vLLM version: v0.11.2
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.2

---------

Signed-off-by: chenmenglong <[email protected]>
Signed-off-by: zengran <[email protected]>
Co-authored-by: zengran <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants