Skip to content

Conversation

@linfeng-yuan
Copy link
Collaborator

@linfeng-yuan linfeng-yuan commented Dec 8, 2025

What this PR does / why we need it?

This PR eliminates the simplicit HD synchronization in sfa backend, and _build_dummy_attn_metadata and dummy_run in mtp_proposer, significantly improving dsv3.2 performance in low-latency scenarios.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Performance improvements are observed with E2E performance serving (P: DP4TP8EP32 D: DP8TP4EP32) with num_speculative_tokens=3.

DSV3.2-W8A8-EXP:
TPOT: 41.67ms -> 23.36ms
ITL: 85.93ms -> 55.96ms

DSV3.2-W8A8 (relaesed in December):
TPOT: 18.11ms
ITL: 56.13ms

@linfeng-yuan linfeng-yuan force-pushed the fix_dsv32_fullgraph_and_mtp_performance branch from 3523634 to 8907b7c Compare December 8, 2025 14:47
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces performance improvements for dsv3.2 by optimizing synchronization related to HD operations and asynchronous scheduling. Key changes include explicitly using CPU tensors for certain calculations (query_start_loc_cpu, query_lens_cpu) and disabling ACL graphs during asynchronous scheduling to prevent synchronization overhead. A minor typo was also corrected. The changes appear to be well-aligned with the stated objective of improving performance and maintaining correctness.

@github-actions
Copy link

github-actions bot commented Dec 8, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@linfeng-yuan linfeng-yuan force-pushed the fix_dsv32_fullgraph_and_mtp_performance branch from 8907b7c to 4370aa6 Compare December 9, 2025 02:26
@linfeng-yuan linfeng-yuan force-pushed the fix_dsv32_fullgraph_and_mtp_performance branch from 4370aa6 to 10716ff Compare December 9, 2025 02:26
@wangxiyuan
Copy link
Collaborator

#4706 wait this PR be merged first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants