[V1][Spec Decode][Perf] Add fused Triton kernel to reduce overhead in EAGLE spec decoding #4

leo-cf-tian · 2025-05-15T20:40:59Z

Co-authored-by: Aaron Pham <[email protected]>

Signed-off-by: Lucas Wilkinson <[email protected]>

Signed-off-by: Aaron Pham <[email protected]> Co-authored-by: Russell Bryant <[email protected]>

…-project#17826) Signed-off-by: Jerry Zhang <[email protected]>

) Signed-off-by: Russell Bryant <[email protected]>

Signed-off-by: mgoin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]>

…ct#17945) Signed-off-by: Chen Zhang <[email protected]>

Signed-off-by: Mark McLoughlin <[email protected]>

Signed-off-by: Aaron Pham <[email protected]>

Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]>

…m-project#18154) Signed-off-by: Luka Govedič <[email protected]>

Signed-off-by: Harry Mellor <[email protected]>

…llm-project#18013) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]>

…#18091)

Signed-off-by: Andy Xie <[email protected]>

Signed-off-by: inkcherry <[email protected]>

…llm-project#18178) Signed-off-by: Mengqing Cao <[email protected]>

Signed-off-by: David Xia <[email protected]>

Signed-off-by: Russell Bryant <[email protected]>

Signed-off-by: omahs <[email protected]>

Signed-off-by: Harry Mellor <[email protected]>

Signed-off-by: yangxia <[email protected]>

…vllm-project#18161) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]>

… in AMD Pipeline (vllm-project#18106) Signed-off-by: Alexei V. Ivanov <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Signed-off-by: Harry Mellor <[email protected]>

…8190) Signed-off-by: Sebastian Schönnenbeck <[email protected]>

…Error to ValueError (vllm-project#18181) Signed-off-by: Abatom <[email protected]>

… unquantizedMethod to reenable LLama4 BF16 (vllm-project#18205) Signed-off-by: tjtanaa <[email protected]>

Signed-off-by: NickLucche <[email protected]>

Signed-off-by: Leo Tian <[email protected]>

github-actions · 2025-05-15T20:41:09Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

bnellnm and others added 30 commits May 14, 2025 13:11

Modularize fused experts and integrate PPLX kernels (vllm-project#15956)

f9c069c

[CI] Disable Failing Tests (vllm-project#18165)

8568650

[Frontend] decrease import time of vllm.multimodal (vllm-project#18031)

749f792

Co-authored-by: Aaron Pham <[email protected]>

[Kernel] Have rotary embeddings support tensors (vllm-project#18046)

d93c976

Signed-off-by: Lucas Wilkinson <[email protected]>

[V1] Structured Outputs + Thinking compatibility (vllm-project#16577)

2fc9075

Signed-off-by: Aaron Pham <[email protected]> Co-authored-by: Russell Bryant <[email protected]>

Add support for loading torchao models with AOPerModuleConfig (vllm…

7974736

…-project#17826) Signed-off-by: Jerry Zhang <[email protected]>

[CI] Fix race condition in test_kv_cache_events test (vllm-project#18169

78aa341

) Signed-off-by: Russell Bryant <[email protected]>

[V1] Support multiple kv connectors (vllm-project#17564)

2142035

Signed-off-by: mgoin <[email protected]> Signed-off-by: Nick Hill <[email protected]> Co-authored-by: Nick Hill <[email protected]>

Upload vllm index for the rc builds (vllm-project#18173)

09f106a

[Bugfix]: make most of test_openai_schema.py pass (vllm-project#17664)

f25e0d1

[v1] Support multiple KV cache groups in GPU model runner (vllm-proje…

e60f550

…ct#17945) Signed-off-by: Chen Zhang <[email protected]>

[V1][Metrics] Remove unused code (vllm-project#18158)

65334ef

Signed-off-by: Mark McLoughlin <[email protected]>

[Chore] astral's ty (vllm-project#18116)

afe3236

Signed-off-by: Aaron Pham <[email protected]>

[Misc] add lobe-chat support (vllm-project#18177)

2dff093

Signed-off-by: reidliu41 <[email protected]> Co-authored-by: reidliu41 <[email protected]>

[Fix][ROCm] Enforce eager for all encoder-decoder models on ROCm (vll…

83f74c6

…m-project#18154) Signed-off-by: Luka Govedič <[email protected]>

Update deprecated type hinting in models (vllm-project#18132)

26d0419

Signed-off-by: Harry Mellor <[email protected]>

[Bugfix] Fix fp8 tests for triton_unified_attention for Triton 3.3 (v…

e6b8e65

…llm-project#18013) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]>

Support custom implementations of VideoLoader backends. (vllm-project…

4f07a64

…#18091)

[UT] Add ut for none hash (vllm-project#17892)

420caf7

Signed-off-by: Andy Xie <[email protected]>

[Model] Allow the use of sliding window in Qwen2 (vllm-project#17772)

dd2a945

Signed-off-by: inkcherry <[email protected]>

[Bugfix] Fix FusedMoEPrepareAndFinalize for cuda-disalike backends (v…

70f8b96

…llm-project#18178) Signed-off-by: Mengqing Cao <[email protected]>

[CI] don't skip fixed test_kv_cache_events() (vllm-project#18183)

de71fec

Signed-off-by: David Xia <[email protected]>

[V1] Update zmq socket creation in nixl connector (vllm-project#18148)

a8f5aec

Signed-off-by: Russell Bryant <[email protected]>

fix: typos (vllm-project#18151)

a9944aa

Signed-off-by: omahs <[email protected]>

Update deprecated type hinting in model_loader (vllm-project#18130)

07ad271

Signed-off-by: Harry Mellor <[email protected]>

add tools into TokenizeChatRequest (vllm-project#18187)

451da4b

Signed-off-by: yangxia <[email protected]>

[Kernel] [V1] Fix performance regression for triton unified attention (…

01c2233

…vllm-project#18161) Signed-off-by: Thomas Parnell <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]>

Adding "Basic Models Test" and "Multi-Modal Models Test (Extended) 3"…

566ec04

… in AMD Pipeline (vllm-project#18106) Signed-off-by: Alexei V. Ivanov <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

Improve examples rendering in docs and GitHub (vllm-project#18203)

51ff154

Signed-off-by: Harry Mellor <[email protected]>

[Frontend] Fix chat template content format detection (vllm-project#1…

2aa5470

…8190) Signed-off-by: Sebastian Schönnenbeck <[email protected]>

Abatom and others added 5 commits May 15, 2025 09:01

[Bugfix]Change the exception thrown by call_hf_processor from Runtime…

fadb8d5

…Error to ValueError (vllm-project#18181) Signed-off-by: Abatom <[email protected]>

[Bugfix] [ROCm]: Remove assertion logic when using AITER fused moe in…

9254052

… unquantizedMethod to reenable LLama4 BF16 (vllm-project#18205) Signed-off-by: tjtanaa <[email protected]>

[Misc] Avoid cuda graph log when sizes still match (vllm-project#18202)

e3f3aee

Signed-off-by: NickLucche <[email protected]>

triton kernel fusion for EAGLE

61c0b12

Signed-off-by: Leo Tian <[email protected]>

include all state updates

c89f9ca

Signed-off-by: Leo Tian <[email protected]>

leo-cf-tian closed this May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[V1][Spec Decode][Perf] Add fused Triton kernel to reduce overhead in EAGLE spec decoding #4

[V1][Spec Decode][Perf] Add fused Triton kernel to reduce overhead in EAGLE spec decoding #4

Uh oh!

leo-cf-tian commented May 15, 2025

Uh oh!

github-actions bot commented May 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

28 participants

[V1][Spec Decode][Perf] Add fused Triton kernel to reduce overhead in EAGLE spec decoding #4

[V1][Spec Decode][Perf] Add fused Triton kernel to reduce overhead in EAGLE spec decoding #4

Uh oh!

Conversation

leo-cf-tian commented May 15, 2025

Uh oh!

github-actions bot commented May 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

28 participants