Skip to content

Conversation

@zhuyuhua-v
Copy link

@zhuyuhua-v zhuyuhua-v commented Nov 14, 2025

Purpose

use aiter triton kernel as triton mha fallback path instead of aiter fmha kernel.

Test Plan

server:

export VLLM_USE_V1=1
export SAFETENSORS_FAST_GPU=1
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_RMSNORM=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_USE_TRITON_FLASH_ATTN=1
export NCCL_DEBUG=WARN
export VLLM_RPC_TIMEOUT=1800000
export VLLM_ROCM_USE_AITER_MHA=0
export VLLM_ROCM_USE_TRITON_ROPE=1
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1

export VLLM_ROCM_USE_AITER_MLA=0 # triton path

model_path="path_to_model/deepseek-ai/DeepSeek-V3"
vllm serve $model_path \
    --tensor-parallel-size 8 \
    --max-num-batched-tokens 32768 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --disable-log-requests \
    --gpu_memory_utilization 0.9 \
    --port 6789 \
    --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
    --block-size 16 \
    --async-scheduling \
    --enforce-eager \

accuracy:

model="path_to_model/deepseek-ai/DeepSeek-V3"
lm_eval \
--model local-completions \
--tasks gsm8k \
--seed 123 \
--model_args model=${model},base_url=http://127.0.0.1:6789/v1/completions \
--batch_size 100 \

Test Result

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.9477 ± 0.0061
strict-match 5 exact_match 0.9454 ± 0.0063

@zhuyuhua-v zhuyuhua-v changed the title use aiter triton kernel as triton mha fallback path [rocm]use aiter triton kernel as triton mha fallback path Nov 14, 2025
@zhuyuhua-v zhuyuhua-v marked this pull request as draft November 14, 2025 05:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants