Skip to content

Conversation

@OsirisDuan
Copy link

@OsirisDuan OsirisDuan commented Nov 20, 2025

What this PR does / why we need it?

This PR introduces a new Triton kernel _causal_conv1d_update_kernel_no_cache_len_no_mtp to support efficient causal 1D convolution updates in Qwen3-next. The kernel is integrated into causal_conv1d_update_npu, enabling better performance on Ascend NPU hardware.

Does this PR introduce any user-facing change?

It belongs to optimization at the internal implementation level, which helps improve the final model inference performance but does not change the user experience at the API level or in terms of invocation methods.

How was this patch tested?

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new Triton kernel _causal_conv1d_update_kernel_no_cache_len_no_mtp for a specific case of causal 1D convolution, and integrates it into the existing causal_conv1d_update_npu function. While the addition is a good optimization, I've found a critical bug in the implementation of the new kernel where it fails to correctly write outputs for sequences longer than one token, causing data loss. My review includes a specific comment with a code suggestion to fix this issue.

Comment on lines +538 to +543
tl.store(
out_ptr
+ pid * out_batch_stride
+ (doffs + tl.arange(0, DIM_BLOCK)) * out_len,
result,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical bug in this loop. The tl.store operation writes to a memory location that does not depend on the loop variable i. As a result, for sequences with seq_len > 1, the output for each token will be written to the same location, overwriting the previous one. Only the result for the last token (i = seq_len - 1) will be preserved, leading to incorrect convolution output. To fix this, you need to include the token index i in the output pointer calculation to ensure each token's result is stored in its correct position.

Suggested change
tl.store(
out_ptr
+ pid * out_batch_stride
+ (doffs + tl.arange(0, DIM_BLOCK)) * out_len,
result,
)
tl.store(
out_ptr
+ pid * out_batch_stride
+ (doffs + tl.arange(0, DIM_BLOCK)) * out_len + i,
result,
)

@OsirisDuan OsirisDuan changed the title Add _causal_conv1d_update_kernel_no_cache_len_no_mtp [task] Add causal_conv1d_update triton kernel Nov 20, 2025
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@OsirisDuan OsirisDuan force-pushed the 251120_QWen3-Next_conv1d_update branch from e6c44b3 to 6d1ffa3 Compare December 1, 2025 06:03
@OsirisDuan OsirisDuan changed the title [task] Add causal_conv1d_update triton kernel [feat] Add causal_conv1d_update triton kernel Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant