Adding non-CB pipeline for Speculative Decoding #2544

AsyaPronina · 2025-08-06T11:08:06Z

Adds Speculative Decoding pipeline, that works without Continious Batching (mainly for purpose to work with NPU)

src/cpp/src/speculative_decoding/speculative_decoding_npu.cpp

dmatveev · 2025-08-06T20:41:47Z

src/cpp/src/speculative_decoding/speculative_decoding_npu.hpp

+
+    void remove_last_generated_tokens(const std::size_t tokens_to_remove); 
+
+    void trimm_kv_cache(const std::size_t tokens_to_remove);


it must be trim_ I believe

Fixed, thanks!

songbell · 2025-08-08T06:16:20Z

I have a common question: is it able to scale to CPU/GPU as well, if we want to run speculative on non CB mode?

AsyaPronina · 2025-08-10T01:40:29Z

I have a common question: is it able to scale to CPU/GPU as well, if we want to run speculative on non CB mode?

Hello! It works on CPU now. But GenAI will choose this non-CB Speculative Decoding path only if we mentioned NPU in main or draft or both models. Do we need to change it to allow to run on other devices?

songbell · 2025-08-11T02:12:39Z

I have a common question: is it able to scale to CPU/GPU as well, if we want to run speculative on non CB mode?

Hello! It works on CPU now. But GenAI will choose this non-CB Speculative Decoding path only if we mentioned NPU in main or draft or both models. Do we need to change it to allow to run on other devices?

I think it depends on whether we have scenarios that non CB path can have perf out-perf CB path? BTW, does it plan to support eagle speculative decoding as well?

src/cpp/src/speculative_decoding/speculative_decoding_npu.hpp

smirnov-alexey · 2025-08-25T17:23:24Z

src/cpp/src/llm/pipeline_stateful_npu.cpp

+// FIXME: Do we need it?
+// void StatefulLLMPipelineNPU::reset_kv_state() {
+//     m_pimpl->reset_kv_state();
+// }


Please don't forget to address it

Thanks a lot!

Copilot

Pull Request Overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-25T10:41:40Z

src/cpp/src/utils.hpp

 #include "openvino/genai/visual_language/pipeline.hpp"
 #include "openvino/runtime/core.hpp"

 #include "openvino/genai/generation_handle.hpp"


Duplicate include of 'openvino/genai/generation_handle.hpp' on line 14 and removed line. The include on line 14 should be removed as it's already included properly with the other includes.

Suggested change

#include "openvino/genai/generation_handle.hpp"

Thank you, I think I will leave as is

Copilot · 2025-09-25T10:41:41Z

src/cpp/src/utils.hpp

 #include "visual_language/processor_config.hpp"

-#include "openvino/genai/generation_handle.hpp"
 #include "openvino/genai/streamer_base.hpp"


Duplicate include of 'openvino/genai/generation_handle.hpp' on line 14 and removed line. The include on line 14 should be removed as it's already included properly with the other includes.

Thank you, I think I will leave as is

Copilot · 2025-09-25T10:41:41Z

src/cpp/src/speculative_decoding/speculative_decoding_stateful.cpp

+    if (matches_num == m_candidates_num) {
+        m_candidates_num = std::min(m_candidates_num + 2, m_max_candidates_num);
+    } else {
+        m_candidates_num = static_cast<std::size_t>(std::max(static_cast<int64_t>(m_candidates_num) - 1, int64_t(1)));


The cast to int64_t and back to size_t is unnecessarily complex. Since we're ensuring the result is at least 1, this can be simplified to: m_candidates_num = std::max(m_candidates_num - 1, size_t(1));

Suggested change

m_candidates_num = static_cast<std::size_t>(std::max(static_cast<int64_t>(m_candidates_num) - 1, int64_t(1)));

m_candidates_num = std::max(m_candidates_num - 1, std::size_t(1));

If m_candidates_num is unsigned, then m_candidates_num - 1 could lead to overflow and be chosen from set of {overflowed_number, 1} what is not intended here

Copilot · 2025-09-25T10:41:41Z

src/cpp/src/speculative_decoding/speculative_decoding_stateful.cpp

+        auto& main_perf_generated_tokens = m_main_request->raw_perf_metrics.m_batch_sizes.back();
+        main_perf_generated_tokens -= mismatched_candidates;
+        m_sd_metrics.update_draft_generated_len(0 /* request_id */, candidates_to_generate);
+        m_sd_metrics.update_acceptance_rate(0 /* request_id */, (accepted_tokens_number /  candidates_to_generate) * 100);


Integer division will always result in 0 or 1 before multiplication by 100. This should be: (accepted_tokens_number * 100.0) / candidates_to_generate to get the correct percentage calculation.

Suggested change

m_sd_metrics.update_acceptance_rate(0 /* request_id */, (accepted_tokens_number / candidates_to_generate) * 100);

m_sd_metrics.update_acceptance_rate(0 /* request_id */, (accepted_tokens_number * 100.0) / candidates_to_generate);

Great catch! Fixed, thanks!

Copilot · 2025-09-25T10:41:42Z

samples/python/text_generation/speculative_decoding_lm.py

+    # NOTE: ContinuousBatching backend uses `num_assistant_tokens` as is. Stateful backend uses `num_assistant_tokens`'s copy as initial
+    # value and adjusts it based on recent number of accepted tokens. If `num_assistant_tokens` is not set, it defaults to `5` for both
+    # backends.
+    # config.num_assistant_tokens = 5


[nitpick] The comment suggests this line is commented out, but the original version had this uncommented. This change makes the default behavior unclear to users who might expect the parameter to be set.

Suggested change

# config.num_assistant_tokens = 5

config.num_assistant_tokens = 5

Fixed after Sofia's comment

src/cpp/src/speculative_decoding/speculative_decoding_stateful.cpp

src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp

src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp

src/cpp/src/speculative_decoding/speculative_decoding_stateful.cpp

AsyaPronina · 2025-09-30T13:55:08Z

src/cpp/src/speculative_decoding/speculative_decoding_stateful.cpp

+    m_request.set_tensor("input_ids", new_input_ids);
+
+    auto attention_mask = m_request.get_tensor("attention_mask");
+    ov::Tensor new_attention_mask(attention_mask.get_element_type(), ov::Shape{BATCH_SIZE, m_num_processed_tokens + tokens_size});


Create a foolow-up for performance imrpovement and removing of allocation

Check why not just reshape : old issue?

Issue is reproduced here:

RuntimeError: Exception from src\core\src\runtime\tensor.cpp:85: Check 'shape_size(new_shape) <= ov::shape_size(m_capacity)' failed at src\inference\src\dev\make_tensor.cpp:96: Could set new shape: [1,5]

src/cpp/src/speculative_decoding/speculative_decoding_stateful.cpp

src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp

samples/python/text_generation/speculative_decoding_lm.py

samples/cpp/text_generation/speculative_decoding_lm.cpp

as-suvorov · 2025-10-02T15:53:54Z

tests/python_tests/test_stateful_speculative_decoding.py

+    prompt = "Alan Turing was a"
+
+    # Download and convert model:
+    main_opt_model, main_hf_tokenizer, main_model_path = download_and_convert_model(MODEL_UNDER_TEST["name"])


Please parametrize model name

Fixed, thanks!

as-suvorov · 2025-10-02T16:04:09Z

tests/python_tests/test_stateful_speculative_decoding.py

+    #        It seems like temporary directory from model downloading stage isn't removed after test
+    #        launch for SmolLM2-360M model, that is why it is not used now.


There is no temporary directory for the models in the CI. Models are downloaded to OV_CACHE path which is shared between workflows.
You can use SmolLM2-360M here.

Thank you!!

Fixed, thanks!

…ed for one or both the models

AsyaPronina · 2025-10-07T13:05:36Z

Here is a commit to restrict StatefulSpeculativeLLMPipeline to work only if one of the models is requested to be executed on NPU currently: efcbb66

Please add thumbs-up to approve it or comment if you don't agree: @dmatveev @as-suvorov @sbalandi

AsyaPronina marked this pull request as draft August 6, 2025 11:08

github-actions bot added category: LLM LLM pipeline (stateful, static) category: speculative decoding Speculative decoding no-match-files labels Aug 6, 2025

AsyaPronina force-pushed the spec_decode_on_npu branch 2 times, most recently from 3cba904 to cd6151a Compare August 6, 2025 11:17

AsyaPronina commented Aug 6, 2025

View reviewed changes

src/cpp/src/speculative_decoding/speculative_decoding_npu.cpp Outdated Show resolved Hide resolved

cwzrad mentioned this pull request Aug 6, 2025

[NPUW] Support generate more than 1 token per inference openvinotoolkit/openvino#31578

Merged

dmatveev reviewed Aug 6, 2025

View reviewed changes

github-actions bot added the category: continuous batching Continuous batching label Aug 7, 2025

AsyaPronina force-pushed the spec_decode_on_npu branch from 25a6e69 to 842d773 Compare August 7, 2025 00:35

github-actions bot removed the category: continuous batching Continuous batching label Aug 7, 2025

Wovchena requested a review from Copilot August 7, 2025 11:59

This comment was marked as outdated.

Sign in to view

peterchen-intel requested review from songbell and wangleis August 8, 2025 05:48

AsyaPronina force-pushed the spec_decode_on_npu branch 3 times, most recently from 0a0a00d to 2092313 Compare August 10, 2025 01:38

AsyaPronina force-pushed the spec_decode_on_npu branch from 6388eff to 9d5890a Compare August 19, 2025 21:40

AsyaPronina marked this pull request as ready for review August 20, 2025 02:25

AsyaPronina commented Aug 20, 2025

View reviewed changes

src/cpp/src/speculative_decoding/speculative_decoding_npu.hpp Outdated Show resolved Hide resolved

AsyaPronina commented Aug 20, 2025

View reviewed changes

src/cpp/src/speculative_decoding/speculative_decoding_npu.hpp Outdated Show resolved Hide resolved

AsyaPronina commented Aug 20, 2025

View reviewed changes

src/cpp/src/speculative_decoding/speculative_decoding_npu.hpp Outdated Show resolved Hide resolved

AsyaPronina force-pushed the spec_decode_on_npu branch 2 times, most recently from 0a4bf49 to 164bdf2 Compare August 25, 2025 16:37

smirnov-alexey reviewed Aug 25, 2025

View reviewed changes

Wovchena requested review from as-suvorov and Copilot September 25, 2025 10:40

Copilot AI reviewed Sep 25, 2025

View reviewed changes

github-actions bot added the category: GGUF GGUF file reader label Sep 29, 2025

AsyaPronina force-pushed the spec_decode_on_npu branch 2 times, most recently from 43205b4 to 8267424 Compare September 29, 2025 20:56

Fixed last comments and added tests

1bf82fc

AsyaPronina force-pushed the spec_decode_on_npu branch from 8267424 to 1bf82fc Compare September 29, 2025 21:07

dmatveev added this to the 2025.4 milestone Sep 30, 2025