Skip to content

Conversation

@lulina
Copy link

@lulina lulina commented Nov 28, 2025

What this PR does / why we need it?

This patch adds support for the xlite graph wrapper to vllm_ascend. Xlite provides operator implementations of the transformer network on Ascend hardware. For details about xlite, please refer to the following link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md
The latest performance comparison data between xlite and the default aclgraph mode is as follows:

Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison

  • aclgraph: main(c4a71fc)
  • xlite-full: main(c4a71fc) + xlite-full
  • xlite-decode-only: main(c4a71fc) + xlite-decode-only
  • diff1: Performance comparison between xlite-full and aclgraph
  • diff2: Performance comparison between xlite-decode-only and aclgraph
concurrency item TTFT(ms) TPOT(ms) QPS (req/s) OutputSpeed (token/s)
Avg P99 Avg P99
1 aclgraph 181.40 224.55 16.75 16.91 0.11 58.65
1 xlite-full 71.26 138.41 12.78 13.04 0.14 77.57
1 xlite-decode-only 181.47 212.77 12.81 12.90 0.14 76.25
1 diff1 -60.72% -38.36% -23.70% -22.89% 27.27% 32.26%
1 diff2 0.04% -5.25% -23.52% -23.71% 27.27% 30.01%
16 aclgraph 246.36 677.04 20.15 21.59 1.46 749.95
16 xlite-full 143.08 772.47 14.81 15.21 1.99 1022.76
16 xlite-decode-only 251.65 594.64 17.27 18.60 1.70 873.42
16 diff1 -41.92% 14.10% -26.50% -29.55% 36.30% 36.38%
16 diff2 2.15% -12.17% -14.29% -13.85% 16.44% 16.46%
32 aclgraph 308.78 1006.40 26.10 28.89 2.24 1158.40
32 xlite-full 188.75 1350.06 17.37 18.27 3.35 1732.53
32 xlite-decode-only 302.41 998.39 22.07 24.52 2.65 1367.90
32 diff1 -38.87% 34.15% -33.45% -36.76% 49.55% 49.56%
32 diff2 -2.06% -0.80% -15.44% -15.13% 18.30% 18.09%
48 aclgraph 374.15 2140.47 30.17 32.79 2.94 1510.22
48 xlite-full 239.59 1767.78 19.84 20.98 4.46 2292.57
48 xlite-decode-only 357.71 2255.69 26.73 29.66 3.33 1710.06
48 diff1 -35.96% -17.41% -34.24% -36.02% 51.70% 51.80%
48 diff2 -4.39% 5.38% -11.40% -9.55% 13.27% 13.23%
64 aclgraph 401.19 2344.22 34.18 37.70 3.47 1777.12
64 xlite-full 292.08 2481.62 22.35 24.03 5.26 2696.59
64 xlite-decode-only 409.60 2846.45 30.71 34.13 3.86 1978.84
64 diff1 -27.20% 5.86% -34.61% -36.26% 51.59% 51.74%
64 diff2 2.10% 21.42% -10.15% -9.47% 11.24% 11.35%
100 aclgraph 461.18 2944.03 42.77 47.57 4.34 2231.93
100 xlite-full 399.83 3574.64 29.40 32.58 6.27 3222.64
100 xlite-decode-only 470.22 2993.68 40.51 45.60 4.60 2362.36
100 diff1 -13.30% 21.42% -31.26% -31.51% 44.47% 44.39%
100 diff2 1.96% 1.69% -5.28% -4.14% 5.99% 5.84%
150 aclgraph 564.78 4217.38 54.85 61.22 5.10 2619.77
150 xlite-full 562.30 5659.82 38.84 43.61 7.13 3659.03
150 xlite-decode-only 561.76 4021.58 53.88 61.43 5.21 2675.94
150 diff1 -0.44% 34.20% -29.19% -28.77% 39.80% 39.67%
150 diff2 -0.53% -4.64% -1.77% 0.34% 2.16% 2.14%
200 aclgraph 692.50 5131.05 67.67 77.12 5.55 2838.86
200 xlite-full 703.36 7712.50 47.14 53.21 7.88 4025.61
200 xlite-decode-only 679.75 5389.56 67.34 77.04 5.60 2861.68
200 diff1 1.57% 50.31% -30.34% -31.00% 41.98% 41.80%
200 diff2 -1.84% 5.04% -0.49% -0.10% 0.90% 0.80%

Test config:

export TASK_QUEUE_ENABLE=1
export VLLM_USE_V1=1
export OMP_PROC_BIND=false
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1
export VLLM_ASCEND_ENABLE_FLASHCOMM=1
export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1
python -m vllm.entrypoints.openai.api_server \
	--model /mnt/nvme0n1/models/Qwen3-32B  \
	--tensor-parallel-size 8 \
	--gpu-memory-utilization 0.9 \
	--max-num-batched-tokens 8192 \
	--max-num-seqs=200 \
	--block-size 128 \
	--max-model-len 6656 \
	--trust-remote-code \
	--disable-log-requests \
	--served-model-name qwen \
	--no-enable-prefix-caching \
	--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
	--async-scheduling \
	--host ${ip} \
	--port ${port} > ${log} 2>&1 &
vllm bench serve --max-concurrency ${concurrency}  --num-prompts ${num_prompts} --host 127.0.0.1 --port 8080 --model qwen --dataset-name random --backend openai-chat --random-input-len 512 --random-output-len 512 --random-range-ratio 0.05  --tokenizer /mnt/nvme0n1/models/Qwen3-32B --endpoint /v1/chat/completions --ignore-eos

Does this PR introduce any user-facing change?

Enable the xlite graph mode by setting xlite_graph_config:
--additional-config='{"xlite_graph_config": {"enabled": true}}' # Enabled for decode only
--additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' # Enabled for prefill and decode

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions github-actions bot added documentation Improvements or additions to documentation module:core labels Nov 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Euler xlite graph wrapper, which provides significant performance improvements for transformer-based models on Ascend hardware. The changes include adding a new environment variable to enable xlite, a wrapper for the vLLM model, and model-specific configurations for Llama-like architectures.

My review focuses on the new xlite.py implementation. I've identified a critical correctness issue regarding the handling of model layer features, which could lead to unexpected behavior if not all layers are consistent. I've also pointed out a potential performance bottleneck related to device-wide synchronization that could be optimized for better throughput. Overall, this is a valuable contribution, and addressing these points will improve its robustness and performance.

Comment on lines 85 to 102
mha_qkv_bias = [
layer.self_attn.qkv_proj.bias for layer in layers
if hasattr(layer.self_attn.qkv_proj, "bias")
and layer.self_attn.qkv_proj.bias is not None
]
q_norm = [
layer.self_attn.q_norm.weight for layer in layers
if hasattr(layer.self_attn, "q_norm")
]
k_norm = [
layer.self_attn.k_norm.weight for layer in layers
if hasattr(layer.self_attn, "k_norm")
]
if len(mha_qkv_bias) == 0:
config.qkv_bias = False
else:
config.qkv_bias = True
xlite_model.mha_qkv_bias = mha_qkv_bias

if len(q_norm) == 0 or len(k_norm) == 0:
config.qk_norm = False
else:
config.qk_norm = True
xlite_model.mha_q_norm = q_norm
xlite_model.mha_k_norm = k_norm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for detecting qkv_bias and qk_norm assumes that these features are either present on all layers or on none. If a model has these features on only a subset of layers, the code will proceed with an incomplete list of weights/biases. This could lead to crashes or incorrect behavior in the underlying C++ extension, as it might expect a complete list of parameters for all layers.

It's crucial to validate that these features are present consistently across all layers. I suggest adding explicit checks to ensure that if these features are present, they are present on all layers.

        num_layers = len(layers)
        mha_qkv_bias = [
            layer.self_attn.qkv_proj.bias for layer in layers
            if hasattr(layer.self_attn.qkv_proj, "bias")
            and layer.self_attn.qkv_proj.bias is not None
        ]
        if len(mha_qkv_bias) == 0:
            config.qkv_bias = False
        elif len(mha_qkv_bias) == num_layers:
            config.qkv_bias = True
            xlite_model.mha_qkv_bias = mha_qkv_bias
        else:
            raise ValueError(
                "Inconsistent qkv_bias settings across layers. "
                f"Found bias on {len(mha_qkv_bias)}/{num_layers} layers."
            )

        q_norm = [
            layer.self_attn.q_norm.weight for layer in layers
            if hasattr(layer.self_attn, "q_norm")
        ]
        k_norm = [
            layer.self_attn.k_norm.weight for layer in layers
            if hasattr(layer.self_attn, "k_norm")
        ]
        if len(q_norm) == 0 and len(k_norm) == 0:
            config.qk_norm = False
        elif len(q_norm) == num_layers and len(k_norm) == num_layers:
            config.qk_norm = True
            xlite_model.mha_q_norm = q_norm
            xlite_model.mha_k_norm = k_norm
        else:
            raise ValueError(
                "Inconsistent qk_norm settings across layers. "
                f"Found q_norm on {len(q_norm)}/{num_layers} layers and "
                f"k_norm on {len(k_norm)}/{num_layers} layers. "
                "Both must be present on all layers or none."
            )

# and it is necessary to synchronize with the current stream
# of torch.npu here to ensure the validity of the NPU tensors
# in the prepare input process.
torch.npu.synchronize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using torch.npu.synchronize() can introduce a significant performance overhead as it blocks the host until all kernels on all streams on the current device are complete. This is a very strong synchronization primitive.

While the comment mentions it's necessary for correctness with xlite's self-managed stream, it would be more efficient to use a more fine-grained synchronization mechanism if possible. For example, using torch.npu.Event to synchronize only the necessary streams.

Could you investigate if the xlite C++ extension provides an API to wait on a torch.npu.Event? If so, you could replace torch.npu.synchronize() with something like this:

# Before calling xlite_model.forward
event = torch.npu.Event()
event.record()
# Then pass the event to xlite and have it wait.
# e.g., self.xlite_rt.wait_event(event)

This would avoid stalling other unrelated operations on the NPU and could improve overall throughput, especially in concurrent scenarios.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.


## Using XliteGraph

If you want to run Llama or Qwen dense series models with xlite graph mode, please set the environment variable VLLM_ASCEND_ENABLE_XLITE to 1.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to describe installation of xlite first.

Copy link
Collaborator

@wangxiyuan wangxiyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add an e2e test and make sure xlite can be got from pypi before merge. Thanks.

scheduler_config = vllm_config.scheduler_config
max_batch_size = scheduler_config.max_num_seqs
max_seq_len = scheduler_config.max_model_len
config.max_m = scheduler_config.max_num_batched_tokens

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is recommended to encapsulate the scattered config assignments into an independent method to improve cohesion.

@lulina lulina force-pushed the vllm_ascend_xlite branch from 2c54aa6 to a0bda7a Compare December 4, 2025 02:21
@lulina lulina force-pushed the vllm_ascend_xlite branch 7 times, most recently from 197da71 to a777c47 Compare December 5, 2025 00:45
@github-actions
Copy link

github-actions bot commented Dec 5, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@lulina lulina force-pushed the vllm_ascend_xlite branch from b956a15 to 8d950cc Compare December 5, 2025 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation module:core module:tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants