-
Notifications
You must be signed in to change notification settings - Fork 629
[Feat] Add Euler xlite graph wrapper support #4526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the Euler xlite graph wrapper, which provides significant performance improvements for transformer-based models on Ascend hardware. The changes include adding a new environment variable to enable xlite, a wrapper for the vLLM model, and model-specific configurations for Llama-like architectures.
My review focuses on the new xlite.py implementation. I've identified a critical correctness issue regarding the handling of model layer features, which could lead to unexpected behavior if not all layers are consistent. I've also pointed out a potential performance bottleneck related to device-wide synchronization that could be optimized for better throughput. Overall, this is a valuable contribution, and addressing these points will improve its robustness and performance.
| mha_qkv_bias = [ | ||
| layer.self_attn.qkv_proj.bias for layer in layers | ||
| if hasattr(layer.self_attn.qkv_proj, "bias") | ||
| and layer.self_attn.qkv_proj.bias is not None | ||
| ] | ||
| q_norm = [ | ||
| layer.self_attn.q_norm.weight for layer in layers | ||
| if hasattr(layer.self_attn, "q_norm") | ||
| ] | ||
| k_norm = [ | ||
| layer.self_attn.k_norm.weight for layer in layers | ||
| if hasattr(layer.self_attn, "k_norm") | ||
| ] | ||
| if len(mha_qkv_bias) == 0: | ||
| config.qkv_bias = False | ||
| else: | ||
| config.qkv_bias = True | ||
| xlite_model.mha_qkv_bias = mha_qkv_bias | ||
|
|
||
| if len(q_norm) == 0 or len(k_norm) == 0: | ||
| config.qk_norm = False | ||
| else: | ||
| config.qk_norm = True | ||
| xlite_model.mha_q_norm = q_norm | ||
| xlite_model.mha_k_norm = k_norm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for detecting qkv_bias and qk_norm assumes that these features are either present on all layers or on none. If a model has these features on only a subset of layers, the code will proceed with an incomplete list of weights/biases. This could lead to crashes or incorrect behavior in the underlying C++ extension, as it might expect a complete list of parameters for all layers.
It's crucial to validate that these features are present consistently across all layers. I suggest adding explicit checks to ensure that if these features are present, they are present on all layers.
num_layers = len(layers)
mha_qkv_bias = [
layer.self_attn.qkv_proj.bias for layer in layers
if hasattr(layer.self_attn.qkv_proj, "bias")
and layer.self_attn.qkv_proj.bias is not None
]
if len(mha_qkv_bias) == 0:
config.qkv_bias = False
elif len(mha_qkv_bias) == num_layers:
config.qkv_bias = True
xlite_model.mha_qkv_bias = mha_qkv_bias
else:
raise ValueError(
"Inconsistent qkv_bias settings across layers. "
f"Found bias on {len(mha_qkv_bias)}/{num_layers} layers."
)
q_norm = [
layer.self_attn.q_norm.weight for layer in layers
if hasattr(layer.self_attn, "q_norm")
]
k_norm = [
layer.self_attn.k_norm.weight for layer in layers
if hasattr(layer.self_attn, "k_norm")
]
if len(q_norm) == 0 and len(k_norm) == 0:
config.qk_norm = False
elif len(q_norm) == num_layers and len(k_norm) == num_layers:
config.qk_norm = True
xlite_model.mha_q_norm = q_norm
xlite_model.mha_k_norm = k_norm
else:
raise ValueError(
"Inconsistent qk_norm settings across layers. "
f"Found q_norm on {len(q_norm)}/{num_layers} layers and "
f"k_norm on {len(k_norm)}/{num_layers} layers. "
"Both must be present on all layers or none."
)
vllm_ascend/xlite/xlite.py
Outdated
| # and it is necessary to synchronize with the current stream | ||
| # of torch.npu here to ensure the validity of the NPU tensors | ||
| # in the prepare input process. | ||
| torch.npu.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using torch.npu.synchronize() can introduce a significant performance overhead as it blocks the host until all kernels on all streams on the current device are complete. This is a very strong synchronization primitive.
While the comment mentions it's necessary for correctness with xlite's self-managed stream, it would be more efficient to use a more fine-grained synchronization mechanism if possible. For example, using torch.npu.Event to synchronize only the necessary streams.
Could you investigate if the xlite C++ extension provides an API to wait on a torch.npu.Event? If so, you could replace torch.npu.synchronize() with something like this:
# Before calling xlite_model.forward
event = torch.npu.Event()
event.record()
# Then pass the event to xlite and have it wait.
# e.g., self.xlite_rt.wait_event(event)This would avoid stalling other unrelated operations on the NPU and could improve overall throughput, especially in concurrent scenarios.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
fb55386 to
2c54aa6
Compare
|
|
||
| ## Using XliteGraph | ||
|
|
||
| If you want to run Llama or Qwen dense series models with xlite graph mode, please set the environment variable VLLM_ASCEND_ENABLE_XLITE to 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to describe installation of xlite first.
wangxiyuan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add an e2e test and make sure xlite can be got from pypi before merge. Thanks.
vllm_ascend/xlite/xlite.py
Outdated
| scheduler_config = vllm_config.scheduler_config | ||
| max_batch_size = scheduler_config.max_num_seqs | ||
| max_seq_len = scheduler_config.max_model_len | ||
| config.max_m = scheduler_config.max_num_batched_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is recommended to encapsulate the scattered config assignments into an independent method to improve cohesion.
2c54aa6 to
a0bda7a
Compare
197da71 to
a777c47
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
a777c47 to
b956a15
Compare
Signed-off-by: lulina <[email protected]>
Signed-off-by: lulina <[email protected]>
b956a15 to
8d950cc
Compare
What this PR does / why we need it?
This patch adds support for the xlite graph wrapper to vllm_ascend. Xlite provides operator implementations of the transformer network on Ascend hardware. For details about xlite, please refer to the following link: https://gitee.com/openeuler/GVirt/blob/master/xlite/README.md
The latest performance comparison data between xlite and the default aclgraph mode is as follows:
Qwen3 32B TPS 910B3(A2) Online Inference Performance Comparison
Test config:
Does this PR introduce any user-facing change?
Enable the xlite graph mode by setting xlite_graph_config:
--additional-config='{"xlite_graph_config": {"enabled": true}}' # Enabled for decode only
--additional-config='{"xlite_graph_config": {"enabled": true, "full_mode": true}}' # Enabled for prefill and decode
How was this patch tested?