Skip to content

Conversation

kaixuanliu
Copy link
Contributor

No description provided.

@kaixuanliu kaixuanliu marked this pull request as ready for review September 25, 2025 07:45
@kaixuanliu
Copy link
Contributor Author

kaixuanliu commented Sep 25, 2025

I made benchmark for Qwen/Qwen3-4B-Instruct-2507 model, and on Intel XPU, it will get ~10% performance improvement for E2E time. While on A100, there is no obvious performance improvement or drop. Pls let me know if it is OK using this manner to apply rotary kernel, and then I will add the support for more models.

Signed-off-by: Liu, Kaixuan <[email protected]>
@kaixuanliu kaixuanliu marked this pull request as draft September 25, 2025 08:43
@kaixuanliu kaixuanliu marked this pull request as ready for review September 25, 2025 10:26
@Rocketknight1
Copy link
Member

cc @ArthurZucker

Copy link
Contributor

@MekkCyber MekkCyber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this integration @kaixuanliu ! I left few nits to consider

Comment on lines 517 to 519
global use_kernels
use_kernels = getattr(self, "use_kernels", False)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to have an attention kwarg passed use_rotary_kernel for example than defining a global variable like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean add a param called use_rotary_kernel to kwargs here, and passed it down to Qwen3Attention?

from ...cache_utils import Cache, DynamicCache
from ...generation import GenerationMixin
from ...integrations import use_kernel_forward_from_hub
from ...integrations.hub_kernels import rotary_kernel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to lazily load the kernel, because here we are loading it before even knowing if the user wants to use kernels or not

Copy link
Contributor Author

@kaixuanliu kaixuanliu Sep 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for your advice! Have updated related code

Comment on lines 125 to 148
def apply_rotary_kernel(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
"""
Rotary kernel implementation wrapper
Adapts rotary kernels implementation to match HuggingFace apply_rotary_pos_emb signature
"""
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)

q_rotated = q.clone()
k_rotated = k.clone()

# Get half dimension for rotation
half_dim = q.shape[-1] // 2
q1 = q_rotated[..., :half_dim]
q2 = q_rotated[..., half_dim:]
k1 = k_rotated[..., :half_dim]
k2 = k_rotated[..., half_dim:]
if cos.shape[-1] != half_dim:
# Trim cos/sin to match half_dim
cos = cos[..., :half_dim]
sin = sin[..., :half_dim]

# Apply rotary embedding using our kernel
rotary_kernel.apply_rotary(q1, q2, cos, sin, q1, q2, False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try to benchmark the performance with and without this kernel ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on Intel XPU, one single rotary op needs 0.22 ms, and it drops to 0.1 ms after applying this patch. above 2x speedup.

Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: dots1, qwen3, qwen3_moe, qwen3_omni_moe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants