Skip to content

Conversation

MatthewBonanni
Copy link
Contributor

@MatthewBonanni MatthewBonanni commented Sep 16, 2025

Purpose

#24453 Added DCP support but did not support query_len > 1. This PR, which depends on a corresponding FlashAttention PR (vllm-project/flash-attention#92), implements a custom causal mask to take advantage of the FlashAttention MLA backend's capability for query_len > 1, thereby enabling MTP.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
@mergify mergify bot added the v1 label Sep 16, 2025
@MatthewBonanni MatthewBonanni changed the title Support MTP with context parallelism [Attention] Support MTP with context parallelism Sep 16, 2025
@MatthewBonanni MatthewBonanni changed the title [Attention] Support MTP with context parallelism [Attention] Support MTP with DCP Sep 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to enable Multi-Token Parallelism (MTP) with context parallelism by supporting query_len > 1 in the FlashAttention MLA backend. The changes involve removing previous restrictions and adding metadata for a custom causal mask.

I've found a critical issue where the new logic to compute query_base_positions in MLACommonMetadataBuilder is not being used by FlashAttnMLAMetadataBuilder because it overrides the _build_decode method. This will prevent the feature from working as intended. Please see my detailed comment.

Comment on lines +621 to +632
# Compute DCP query base positions if using DCP
query_base_positions = None

if self.dcp_world_size > 1:
query_lens = query_start_loc_cpu[1:] - query_start_loc_cpu[:-1]
query_base_positions = (seq_lens_cpu - query_lens).to(
seq_lens_device.device)

return MLACommonDecodeMetadata(
block_table=block_table_tensor,
seq_lens=seq_lens_device,
query_base_positions=query_base_positions,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change correctly computes query_base_positions for DCP. However, FlashAttnMLAMetadataBuilder in vllm/v1/attention/backends/mla/flashattn_mla.py overrides _build_decode and does not call this base implementation. As a result, query_base_positions will be None for the FlashAttention MLA backend, and the MTP with context parallelism feature will not work correctly.

To fix this, you should move this logic to FlashAttnMLAMetadataBuilder._build_decode or refactor it so that FlashAttnMLAMetadataBuilder can reuse this logic. For example, you could add the logic to FlashAttnMLAMetadataBuilder._build_decode and pass query_base_positions to the FlashAttnMLADecodeMetadata constructor.

@MatthewBonanni MatthewBonanni marked this pull request as draft September 17, 2025 15:32
Copy link

mergify bot commented Sep 24, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @MatthewBonanni.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 24, 2025
@MatthewBonanni
Copy link
Contributor Author

superseded by #25049

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant