[Feat] Add Euler xlite graph wrapper support #4526

lulina · 2025-11-28T03:33:28Z

What this PR does / why we need it?

This patch adds support for the xlite graph wrapper to vllm_ascend. Xlite provides operator implementations of the transformer network on Ascend hardware. For details about xlite, please refer to the following link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md
The latest performance comparison data between xlite and the default aclgraph mode is as follows:

Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison

aclgraph: main(c4a71fc)
xlite-full: main(c4a71fc) + xlite-full
xlite-decode-only: main(c4a71fc) + xlite-decode-only
diff1: Performance comparison between xlite-full and aclgraph
diff2: Performance comparison between xlite-decode-only and aclgraph

concurrency	item	TTFT(ms)		TPOT(ms)		QPS (req/s)	OutputSpeed (token/s)
		Avg	P99	Avg	P99
1	aclgraph	181.40	224.55	16.75	16.91	0.11	58.65
1	xlite-full	71.26	138.41	12.78	13.04	0.14	77.57
1	xlite-decode-only	181.47	212.77	12.81	12.90	0.14	76.25
1	diff1	-60.72%	-38.36%	-23.70%	-22.89%	27.27%	32.26%
1	diff2	0.04%	-5.25%	-23.52%	-23.71%	27.27%	30.01%

16	aclgraph	246.36	677.04	20.15	21.59	1.46	749.95
16	xlite-full	143.08	772.47	14.81	15.21	1.99	1022.76
16	xlite-decode-only	251.65	594.64	17.27	18.60	1.70	873.42
16	diff1	-41.92%	14.10%	-26.50%	-29.55%	36.30%	36.38%
16	diff2	2.15%	-12.17%	-14.29%	-13.85%	16.44%	16.46%

32	aclgraph	308.78	1006.40	26.10	28.89	2.24	1158.40
32	xlite-full	188.75	1350.06	17.37	18.27	3.35	1732.53
32	xlite-decode-only	302.41	998.39	22.07	24.52	2.65	1367.90
32	diff1	-38.87%	34.15%	-33.45%	-36.76%	49.55%	49.56%
32	diff2	-2.06%	-0.80%	-15.44%	-15.13%	18.30%	18.09%

48	aclgraph	374.15	2140.47	30.17	32.79	2.94	1510.22
48	xlite-full	239.59	1767.78	19.84	20.98	4.46	2292.57
48	xlite-decode-only	357.71	2255.69	26.73	29.66	3.33	1710.06
48	diff1	-35.96%	-17.41%	-34.24%	-36.02%	51.70%	51.80%
48	diff2	-4.39%	5.38%	-11.40%	-9.55%	13.27%	13.23%

64	aclgraph	401.19	2344.22	34.18	37.70	3.47	1777.12
64	xlite-full	292.08	2481.62	22.35	24.03	5.26	2696.59
64	xlite-decode-only	409.60	2846.45	30.71	34.13	3.86	1978.84
64	diff1	-27.20%	5.86%	-34.61%	-36.26%	51.59%	51.74%
64	diff2	2.10%	21.42%	-10.15%	-9.47%	11.24%	11.35%

100	aclgraph	461.18	2944.03	42.77	47.57	4.34	2231.93
100	xlite-full	399.83	3574.64	29.40	32.58	6.27	3222.64
100	xlite-decode-only	470.22	2993.68	40.51	45.60	4.60	2362.36
100	diff1	-13.30%	21.42%	-31.26%	-31.51%	44.47%	44.39%
100	diff2	1.96%	1.69%	-5.28%	-4.14%	5.99%	5.84%

150	aclgraph	564.78	4217.38	54.85	61.22	5.10	2619.77
150	xlite-full	562.30	5659.82	38.84	43.61	7.13	3659.03
150	xlite-decode-only	561.76	4021.58	53.88	61.43	5.21	2675.94
150	diff1	-0.44%	34.20%	-29.19%	-28.77%	39.80%	39.67%
150	diff2	-0.53%	-4.64%	-1.77%	0.34%	2.16%	2.14%

200	aclgraph	692.50	5131.05	67.67	77.12	5.55	2838.86
200	xlite-full	703.36	7712.50	47.14	53.21	7.88	4025.61
200	xlite-decode-only	679.75	5389.56	67.34	77.04	5.60	2861.68
200	diff1	1.57%	50.31%	-30.34%	-31.00%	41.98%	41.80%
200	diff2	-1.84%	5.04%	-0.49%	-0.10%	0.90%	0.80%

Test config：

export TASK_QUEUE_ENABLE=1
export VLLM_USE_V1=1
export OMP_PROC_BIND=false
export HCCL_OP_EXPANSION_MODE="AIV"
export VLLM_ASCEND_ENABLE_TOPK_OPTIMIZE=1
export VLLM_ASCEND_ENABLE_FLASHCOMM=1
export VLLM_ASCEND_ENABLE_DENSE_OPTIMIZE=1
export VLLM_ASCEND_ENABLE_PREFETCH_MLP=1
python -m vllm.entrypoints.openai.api_server \
	--model /mnt/nvme0n1/models/Qwen3-32B  \
	--tensor-parallel-size 8 \
	--gpu-memory-utilization 0.9 \
	--max-num-batched-tokens 8192 \
	--max-num-seqs=200 \
	--block-size 128 \
	--max-model-len 6656 \
	--trust-remote-code \
	--disable-log-requests \
	--served-model-name qwen \
	--no-enable-prefix-caching \
	--compilation-config '{"cudagraph_mode": "FULL_DECODE_ONLY"}' \
	--async-scheduling \
	--host ${ip} \
	--port ${port} > ${log} 2>&1 &
vllm bench serve --max-concurrency ${concurrency}  --num-prompts ${num_prompts} --host 127.0.0.1 --port 8080 --model qwen --dataset-name random --backend openai-chat --random-input-len 512 --random-output-len 512 --random-range-ratio 0.05  --tokenizer /mnt/nvme0n1/models/Qwen3-32B --endpoint /v1/chat/completions --ignore-eos

Does this PR introduce any user-facing change?

Enable the xlite graph mode by setting xlite_graph_config:
--additional-config='{"xlite_graph_config": {"enabled": true}}' # Enabled for decode only
--additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' # Enabled for prefill and decode

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

github-actions · 2025-11-28T03:33:35Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces support for the Euler xlite graph wrapper, which provides significant performance improvements for transformer-based models on Ascend hardware. The changes include adding a new environment variable to enable xlite, a wrapper for the vLLM model, and model-specific configurations for Llama-like architectures.

My review focuses on the new xlite.py implementation. I've identified a critical correctness issue regarding the handling of model layer features, which could lead to unexpected behavior if not all layers are consistent. I've also pointed out a potential performance bottleneck related to device-wide synchronization that could be optimized for better throughput. Overall, this is a valuable contribution, and addressing these points will improve its robustness and performance.

gemini-code-assist · 2025-11-28T03:35:42Z

vllm_ascend/xlite/xlite.py

+        mha_qkv_bias = [
+            layer.self_attn.qkv_proj.bias for layer in layers
+            if hasattr(layer.self_attn.qkv_proj, "bias")
+            and layer.self_attn.qkv_proj.bias is not None
+        ]
+        q_norm = [
+            layer.self_attn.q_norm.weight for layer in layers
+            if hasattr(layer.self_attn, "q_norm")
+        ]
+        k_norm = [
+            layer.self_attn.k_norm.weight for layer in layers
+            if hasattr(layer.self_attn, "k_norm")
+        ]
+        if len(mha_qkv_bias) == 0:
+            config.qkv_bias = False
+        else:
+            config.qkv_bias = True
+            xlite_model.mha_qkv_bias = mha_qkv_bias
+
+        if len(q_norm) == 0 or len(k_norm) == 0:
+            config.qk_norm = False
+        else:
+            config.qk_norm = True
+            xlite_model.mha_q_norm = q_norm
+            xlite_model.mha_k_norm = k_norm


The logic for detecting qkv_bias and qk_norm assumes that these features are either present on all layers or on none. If a model has these features on only a subset of layers, the code will proceed with an incomplete list of weights/biases. This could lead to crashes or incorrect behavior in the underlying C++ extension, as it might expect a complete list of parameters for all layers.

It's crucial to validate that these features are present consistently across all layers. I suggest adding explicit checks to ensure that if these features are present, they are present on all layers.

num_layers = len(layers) mha_qkv_bias = [ layer.self_attn.qkv_proj.bias for layer in layers if hasattr(layer.self_attn.qkv_proj, "bias") and layer.self_attn.qkv_proj.bias is not None ] if len(mha_qkv_bias) == 0: config.qkv_bias = False elif len(mha_qkv_bias) == num_layers: config.qkv_bias = True xlite_model.mha_qkv_bias = mha_qkv_bias else: raise ValueError( "Inconsistent qkv_bias settings across layers. " f"Found bias on {len(mha_qkv_bias)}/{num_layers} layers." ) q_norm = [ layer.self_attn.q_norm.weight for layer in layers if hasattr(layer.self_attn, "q_norm") ] k_norm = [ layer.self_attn.k_norm.weight for layer in layers if hasattr(layer.self_attn, "k_norm") ] if len(q_norm) == 0 and len(k_norm) == 0: config.qk_norm = False elif len(q_norm) == num_layers and len(k_norm) == num_layers: config.qk_norm = True xlite_model.mha_q_norm = q_norm xlite_model.mha_k_norm = k_norm else: raise ValueError( "Inconsistent qk_norm settings across layers. " f"Found q_norm on {len(q_norm)}/{num_layers} layers and " f"k_norm on {len(k_norm)}/{num_layers} layers. " "Both must be present on all layers or none." )

gemini-code-assist · 2025-11-28T03:35:42Z

vllm_ascend/xlite/xlite.py

+            # and it is necessary to synchronize with the current stream
+            # of torch.npu here to ensure the validity of the NPU tensors
+            # in the prepare input process.
+            torch.npu.synchronize()


Using torch.npu.synchronize() can introduce a significant performance overhead as it blocks the host until all kernels on all streams on the current device are complete. This is a very strong synchronization primitive.

While the comment mentions it's necessary for correctness with xlite's self-managed stream, it would be more efficient to use a more fine-grained synchronization mechanism if possible. For example, using torch.npu.Event to synchronize only the necessary streams.

Could you investigate if the xlite C++ extension provides an API to wait on a torch.npu.Event? If so, you could replace torch.npu.synchronize() with something like this:

# Before calling xlite_model.forward event = torch.npu.Event() event.record() # Then pass the event to xlite and have it wait. # e.g., self.xlite_rt.wait_event(event)

This would avoid stalling other unrelated operations on the NPU and could improve overall throughput, especially in concurrent scenarios.

github-actions · 2025-11-28T06:12:19Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

whx-sjtu · 2025-11-28T08:42:56Z

docs/source/user_guide/feature_guide/graph_mode.md


+## Using XliteGraph
+
+If you want to run Llama or Qwen dense series models with xlite graph mode, please set the environment variable VLLM_ASCEND_ENABLE_XLITE to 1.


I think it's better to describe installation of xlite first.

wangxiyuan

we should add an e2e test and make sure xlite can be got from pypi before merge. Thanks.

Sparkheart · 2025-12-03T13:48:36Z

vllm_ascend/xlite/xlite.py

+        scheduler_config = vllm_config.scheduler_config
+        max_batch_size = scheduler_config.max_num_seqs
+        max_seq_len = scheduler_config.max_model_len
+        config.max_m = scheduler_config.max_num_batched_tokens


It is recommended to encapsulate the scattered config assignments into an independent method to improve cohesion.

github-actions · 2025-12-05T01:07:59Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: lulina <[email protected]>

github-actions bot added documentation Improvements or additions to documentation module:core labels Nov 28, 2025

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Nov 28, 2025

lulina force-pushed the vllm_ascend_xlite branch from fb55386 to 2c54aa6 Compare November 28, 2025 06:43

github-actions bot removed the merge-conflicts label Nov 28, 2025

whx-sjtu reviewed Nov 28, 2025

View reviewed changes

wangxiyuan approved these changes Nov 28, 2025

View reviewed changes

Sparkheart reviewed Dec 3, 2025

View reviewed changes

lulina force-pushed the vllm_ascend_xlite branch from 2c54aa6 to a0bda7a Compare December 4, 2025 02:21

github-actions bot added the module:tests label Dec 4, 2025

lulina force-pushed the vllm_ascend_xlite branch 7 times, most recently from 197da71 to a777c47 Compare December 5, 2025 00:45

github-actions bot added the merge-conflicts label Dec 5, 2025

lulina force-pushed the vllm_ascend_xlite branch from a777c47 to b956a15 Compare December 5, 2025 01:21

github-actions bot removed the merge-conflicts label Dec 5, 2025

lulina added 2 commits December 5, 2025 09:21

[Feat] Add euler xlite graph wrapper support

b3ac128

Signed-off-by: lulina <[email protected]>

[TEST] add xlite e2e test

8d950cc

Signed-off-by: lulina <[email protected]>

lulina force-pushed the vllm_ascend_xlite branch from b956a15 to 8d950cc Compare December 5, 2025 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Add Euler xlite graph wrapper support #4526

[Feat] Add Euler xlite graph wrapper support #4526

lulina commented Nov 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Uh oh!

gemini-code-assist bot Nov 28, 2025

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

whx-sjtu Nov 28, 2025

Uh oh!

wangxiyuan left a comment

Uh oh!

Sparkheart Dec 3, 2025

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		## Using XliteGraph

		If you want to run Llama or Qwen dense series models with xlite graph mode, please set the environment variable VLLM_ASCEND_ENABLE_XLITE to 1.

[Feat] Add Euler xlite graph wrapper support #4526

Are you sure you want to change the base?

[Feat] Add Euler xlite graph wrapper support #4526

Conversation

lulina commented Nov 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

whx-sjtu Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

wangxiyuan left a comment

Choose a reason for hiding this comment

Uh oh!

Sparkheart Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lulina commented Nov 28, 2025 •

edited by github-actions bot

Loading