[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2 #27532

LucasWilkinson · 2025-10-26T13:30:51Z

When doing prefill up-convert the kv-cache from fp8 to bf16 and call the bf16 prefill kernel instead of the decode kernel. This PR introduce global workspace management to have the bf16 workspace overlap with the MoE workspace buffers.

Signed-off-by: Lucas Wilkinson <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-10-26T21:59:39Z

vllm/v1/worker/workspace.py

+    def get(self, spec: "WorkspaceSpec") -> torch.Tensor:
+        """Get a workspace tensor for the given spec.
+
+        Args:
+            spec: The workspace specification.
+
+        Returns:
+            A tensor view into the workspace buffer with the requested shape and dtype.
+        """
+        num_bytes = spec.num_bytes()
+        current_workspace = self._ensure_workspace_size(num_bytes, spec.name)
+        return current_workspace[:num_bytes].view(spec.dtype).reshape(spec.shape)


Allocating workspaces fails due to invalid view call

WorkspaceManager.get reinterprets the byte buffer with current_workspace[:num_bytes].view(spec.dtype) but Tensor.view only accepts a shape, not a dtype. Passing a torch.dtype raises TypeError: 'torch.dtype' object cannot be interpreted as an integer, so every call to reserve/get will crash before returning a workspace. The manager needs to reshape using a size tuple and cast with view(dtype) via reinterpret_cast semantics (e.g. view(-1).view(spec.dtype) or view(dtype).reshape).

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-10-26T21:59:39Z

vllm/v1/attention/backends/mla/flashmla_sparse.py

+            # Process decode tokens
+            if num_decode_tokens > 0:
+                attn_out = self._forward_fp8_kv(
+                    q[:num_decode_tokens],
+                    kv_cache,
+                    topk_indices_global[:num_decode_tokens],
+                    attn_metadata,
+                )
+
+            if num_prefill_tokens > 0:
+                decode_attn_out = attn_out
+                attn_out = q.new_empty(


Prefill-only batches reference attn_out before initialization

In the fp8 path of FlashMLASparseImpl.forward, attn_out is only assigned inside the if num_decode_tokens > 0 branch. The subsequent if num_prefill_tokens > 0 branch unconditionally reads decode_attn_out = attn_out, which raises UnboundLocalError whenever a batch contains only prefill tokens. Prefill batches are common during initial context ingestion, so this path will always fail until attn_out is initialized for the prefill case.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-10-26T21:59:39Z

vllm/v1/attention/backends/mla/flashmla_sparse.py

+            if num_prefill_tokens > 0:
+                decode_attn_out = attn_out
+                attn_out = q.new_empty(
+                    (num_actual_toks, self.num_heads, self.kv_lora_rank),
+                    dtype=q.dtype,
+                    device=q.device,
+                )
+                attn_out[:num_prefill_tokens] = decode_attn_out[:num_prefill_tokens]
+


Decode outputs stored into prefill slots

When both decode and prefill tokens exist, the fp8 path copies decode attention results with attn_out[:num_prefill_tokens] = decode_attn_out[:num_prefill_tokens]. Decode tokens occupy the first num_decode_tokens entries, so this writes them into the wrong slice and fails whenever num_prefill_tokens > num_decode_tokens because the right-hand side is shorter than the target. The assignment should use num_decode_tokens to preserve decode outputs and avoid size mismatches.

Useful? React with 👍 / 👎.

Signed-off-by: Lucas Wilkinson <[email protected]>

LucasWilkinson added 6 commits October 25, 2025 06:22

wip

668a672

Signed-off-by: Lucas Wilkinson <[email protected]>

cleanup

c614551

Signed-off-by: Lucas Wilkinson <[email protected]>

fix

c01345f

Signed-off-by: Lucas Wilkinson <[email protected]>

fix

af4d89d

Signed-off-by: Lucas Wilkinson <[email protected]>

fix

cc28946

Signed-off-by: Lucas Wilkinson <[email protected]>

clean-up revert to triton

b2f4ec2

Signed-off-by: Lucas Wilkinson <[email protected]>

mergify bot added deepseek Related to DeepSeek models v1 labels Oct 26, 2025

LucasWilkinson added 4 commits October 26, 2025 14:33

cleanup

20822ef

Signed-off-by: Lucas Wilkinson <[email protected]>

cleanup

5dd9ecf

Signed-off-by: Lucas Wilkinson <[email protected]>

cleanup

064e38a

Signed-off-by: Lucas Wilkinson <[email protected]>

cleanup

a77689c

Signed-off-by: Lucas Wilkinson <[email protected]>

LucasWilkinson marked this pull request as ready for review October 26, 2025 21:55

LucasWilkinson requested review from ApostaC, WoosukKwon, alexm-redhat, comaniac, heheda12345, mgoin, njhill, pavanimajety, robertgshaw2-redhat and ywang96 as code owners October 26, 2025 21:55

chatgpt-codex-connector bot reviewed Oct 26, 2025

View reviewed changes

keep

b732798

Signed-off-by: Lucas Wilkinson <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2 #27532

[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2 #27532

LucasWilkinson commented Oct 26, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 26, 2025

Uh oh!

chatgpt-codex-connector bot Oct 26, 2025

Uh oh!

chatgpt-codex-connector bot Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2 #27532

Are you sure you want to change the base?

[Attention] Use sparse prefill kernel for fp8 kv-cache in DeepSeek-v3.2 #27532

Conversation

LucasWilkinson commented Oct 26, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant