[FlashInfer] Disable TRTLLM for block_size 16 and head_size 256 #28001

heheda12345 · 2025-11-03T19:56:34Z

Purpose

flashinfer-ai/flashinfer#1993 reports this combination is not correct, so this PR disable it.

Thanks @vadiklyutiy for the exploration on this problem #27704

Test Plan

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4
lm_eval --model local-completions --model_args model=Qwen/Qwen3-Next-80B-A3B-Instruct,base_url=http://localhost:8000/v1/completions -t gsm8k --num_fewshot 5 --batch_size 250

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8522|±  |0.0098|
|     |       |strict-match    |     5|exact_match|↑  |0.8150|±  |0.0107|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chen Zhang <[email protected]>

gemini-code-assist

Code Review

This pull request correctly disables the TRTLLM attention backend for the problematic combination of head_size=256 and block_size=16, as reported in the linked issue. The changes are well-contained, modifying the use_trtllm_attention utility and updating its call sites.

My main feedback is to add a warning when this override occurs while a user has explicitly forced TRTLLM attention via an environment variable. This ensures consistency with how other configuration overrides are handled and improves the user's debugging experience.

gemini-code-assist · 2025-11-03T19:58:06Z

vllm/utils/flashinfer.py

+    if head_size == 256 and block_size == 16:
+        ## https://github.com/flashinfer-ai/flashinfer/issues/1993 reports that`
+        # head size 256 and block size 16 is incorrect on blackwell.
+        return False


When VLLM_USE_TRTLLM_ATTENTION is set to 1, the user is explicitly forcing the use of TRTLLM attention. If we override this setting due to a known bug, we should inform the user with a warning, similar to how other incompatible configurations are handled in this function. This improves transparency and helps with debugging. I've also cleaned up the comment for clarity.

Suggested change

if head_size == 256 and block_size == 16:

## https://github.com/flashinfer-ai/flashinfer/issues/1993 reports that`

# head size 256 and block size 16 is incorrect on blackwell.

return False

if head_size == 256 and block_size == 16:

# Per https://github.com/flashinfer-ai/flashinfer/issues/1993, this

# combination is incorrect on Blackwell GPUs.

if force_use_trtllm:

logger.warning_once(

"Disabling TRTLLM attention for head_size=256 and "

"block_size=16 due to a known bug, even though "

"VLLM_USE_TRTLLM_ATTENTION is set to 1."

)

return False

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-11-03T19:59:31Z

vllm/utils/flashinfer.py

+    if head_size == 256 and block_size == 16:
+        ## https://github.com/flashinfer-ai/flashinfer/issues/1993 reports that`
+        # head size 256 and block size 16 is incorrect on blackwell.


Disable TRTLLM without updating quantization support

The new early return when head_size == 256 and block_size == 16 prevents TRTLLM from being used, but the rest of the FlashInfer backend still treats this configuration as TRTLLM-capable because support_trtllm_attn is computed via can_use_trtllm_attention (which ignores head/block sizes) and supports_quant_query_input() continues to return True. During initialization the attention layer will therefore enable query FP8 quantization, yet at runtime metadata building falls back q_data_type to the model dtype. When the quantized query tensor reaches FlashInferAttentionImpl.forward the assertion attn_metadata.q_data_type == query.dtype trips and inference aborts for the 256/16 combination this change was meant to unblock. can_use_trtllm_attention or supports_quant_query_input should incorporate the new restriction (or clear query_quant) so that query quantization is not attempted when TRTLLM is disabled by this guard.

Useful? React with 👍 / 👎.

disable trtllm

71a1999

Signed-off-by: Chen Zhang <[email protected]>

heheda12345 requested review from mgoin and pavanimajety as code owners November 3, 2025 19:56

mergify bot added the v1 label Nov 3, 2025

gemini-code-assist bot reviewed Nov 3, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Nov 3, 2025

View reviewed changes

heheda12345 closed this Nov 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[FlashInfer] Disable TRTLLM for block_size 16 and head_size 256 #28001

[FlashInfer] Disable TRTLLM for block_size 16 and head_size 256 #28001

Uh oh!

heheda12345 commented Nov 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 3, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-    if head_size == 256 and block_size == 16:
-        ## https://github.com/flashinfer-ai/flashinfer/issues/1993 reports that`
-        # head size 256 and block size 16 is incorrect on blackwell.
-        return False
+    if head_size == 256 and block_size == 16:
+        # Per https://github.com/flashinfer-ai/flashinfer/issues/1993, this
+        # combination is incorrect on Blackwell GPUs.
+        if force_use_trtllm:
+            logger.warning_once(
+                "Disabling TRTLLM attention for head_size=256 and "
+                "block_size=16 due to a known bug, even though "
+                "VLLM_USE_TRTLLM_ATTENTION is set to 1."
+            )
+        return False

Uh oh!

[FlashInfer] Disable TRTLLM for block_size 16 and head_size 256 #28001

[FlashInfer] Disable TRTLLM for block_size 16 and head_size 256 #28001

Uh oh!

Conversation

heheda12345 commented Nov 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Nov 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

heheda12345 commented Nov 3, 2025 •

edited by github-actions bot

Loading