Skip to content

Conversation

AsyaPronina
Copy link
Contributor

@AsyaPronina AsyaPronina commented Aug 6, 2025

  • Adds Speculative Decoding pipeline, that works without Continious Batching (mainly for purpose to work with NPU)

@AsyaPronina AsyaPronina marked this pull request as draft August 6, 2025 11:08
@github-actions github-actions bot added category: LLM LLM pipeline (stateful, static) category: speculative decoding Speculative decoding no-match-files labels Aug 6, 2025
@AsyaPronina AsyaPronina force-pushed the spec_decode_on_npu branch 2 times, most recently from 3cba904 to cd6151a Compare August 6, 2025 11:17

void remove_last_generated_tokens(const std::size_t tokens_to_remove);

void trimm_kv_cache(const std::size_t tokens_to_remove);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it must be trim_ I believe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

@github-actions github-actions bot added the category: continuous batching Continuous batching label Aug 7, 2025
@github-actions github-actions bot removed the category: continuous batching Continuous batching label Aug 7, 2025
@Wovchena Wovchena requested a review from Copilot August 7, 2025 11:59
Copilot

This comment was marked as outdated.

@songbell
Copy link

songbell commented Aug 8, 2025

I have a common question: is it able to scale to CPU/GPU as well, if we want to run speculative on non CB mode?

@AsyaPronina AsyaPronina force-pushed the spec_decode_on_npu branch 3 times, most recently from 0a0a00d to 2092313 Compare August 10, 2025 01:38
@AsyaPronina
Copy link
Contributor Author

AsyaPronina commented Aug 10, 2025

I have a common question: is it able to scale to CPU/GPU as well, if we want to run speculative on non CB mode?

Hello! It works on CPU now. But GenAI will choose this non-CB Speculative Decoding path only if we mentioned NPU in main or draft or both models. Do we need to change it to allow to run on other devices?

@songbell
Copy link

I have a common question: is it able to scale to CPU/GPU as well, if we want to run speculative on non CB mode?

Hello! It works on CPU now. But GenAI will choose this non-CB Speculative Decoding path only if we mentioned NPU in main or draft or both models. Do we need to change it to allow to run on other devices?

I think it depends on whether we have scenarios that non CB path can have perf out-perf CB path? BTW, does it plan to support eagle speculative decoding as well?

@AsyaPronina AsyaPronina marked this pull request as ready for review August 20, 2025 02:25
@AsyaPronina AsyaPronina force-pushed the spec_decode_on_npu branch 2 times, most recently from 0a4bf49 to 164bdf2 Compare August 25, 2025 16:37
// FIXME: Do we need it?
// void StatefulLLMPipelineNPU::reset_kv_state() {
// m_pimpl->reset_kv_state();
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't forget to address it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

#include "openvino/genai/visual_language/pipeline.hpp"
#include "openvino/runtime/core.hpp"

#include "openvino/genai/generation_handle.hpp"
Copy link
Preview

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate include of 'openvino/genai/generation_handle.hpp' on line 14 and removed line. The include on line 14 should be removed as it's already included properly with the other includes.

Suggested change
#include "openvino/genai/generation_handle.hpp"

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I think I will leave as is

Comment on lines 17 to 19
#include "visual_language/processor_config.hpp"

#include "openvino/genai/generation_handle.hpp"
#include "openvino/genai/streamer_base.hpp"
Copy link
Preview

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate include of 'openvino/genai/generation_handle.hpp' on line 14 and removed line. The include on line 14 should be removed as it's already included properly with the other includes.

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I think I will leave as is

if (matches_num == m_candidates_num) {
m_candidates_num = std::min(m_candidates_num + 2, m_max_candidates_num);
} else {
m_candidates_num = static_cast<std::size_t>(std::max(static_cast<int64_t>(m_candidates_num) - 1, int64_t(1)));
Copy link
Preview

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast to int64_t and back to size_t is unnecessarily complex. Since we're ensuring the result is at least 1, this can be simplified to: m_candidates_num = std::max(m_candidates_num - 1, size_t(1));

Suggested change
m_candidates_num = static_cast<std::size_t>(std::max(static_cast<int64_t>(m_candidates_num) - 1, int64_t(1)));
m_candidates_num = std::max(m_candidates_num - 1, std::size_t(1));

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If m_candidates_num is unsigned, then m_candidates_num - 1 could lead to overflow and be chosen from set of {overflowed_number, 1} what is not intended here

auto& main_perf_generated_tokens = m_main_request->raw_perf_metrics.m_batch_sizes.back();
main_perf_generated_tokens -= mismatched_candidates;
m_sd_metrics.update_draft_generated_len(0 /* request_id */, candidates_to_generate);
m_sd_metrics.update_acceptance_rate(0 /* request_id */, (accepted_tokens_number / candidates_to_generate) * 100);
Copy link
Preview

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer division will always result in 0 or 1 before multiplication by 100. This should be: (accepted_tokens_number * 100.0) / candidates_to_generate to get the correct percentage calculation.

Suggested change
m_sd_metrics.update_acceptance_rate(0 /* request_id */, (accepted_tokens_number / candidates_to_generate) * 100);
m_sd_metrics.update_acceptance_rate(0 /* request_id */, (accepted_tokens_number * 100.0) / candidates_to_generate);

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! Fixed, thanks!

# NOTE: ContinuousBatching backend uses `num_assistant_tokens` as is. Stateful backend uses `num_assistant_tokens`'s copy as initial
# value and adjusts it based on recent number of accepted tokens. If `num_assistant_tokens` is not set, it defaults to `5` for both
# backends.
# config.num_assistant_tokens = 5
Copy link
Preview

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment suggests this line is commented out, but the original version had this uncommented. This change makes the default behavior unclear to users who might expect the parameter to be set.

Suggested change
# config.num_assistant_tokens = 5
config.num_assistant_tokens = 5

Copilot uses AI. Check for mistakes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed after Sofia's comment

@github-actions github-actions bot added the category: GGUF GGUF file reader label Sep 29, 2025
@AsyaPronina AsyaPronina force-pushed the spec_decode_on_npu branch 2 times, most recently from 43205b4 to 8267424 Compare September 29, 2025 20:56
@dmatveev dmatveev added this to the 2025.4 milestone Sep 30, 2025
m_request.set_tensor("input_ids", new_input_ids);

auto attention_mask = m_request.get_tensor("attention_mask");
ov::Tensor new_attention_mask(attention_mask.get_element_type(), ov::Shape{BATCH_SIZE, m_num_processed_tokens + tokens_size});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a foolow-up for performance imrpovement and removing of allocation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check why not just reshape : old issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue is reproduced here:

RuntimeError: Exception from src\core\src\runtime\tensor.cpp:85:
Check 'shape_size(new_shape) <= ov::shape_size(m_capacity)' failed at src\inference\src\dev\make_tensor.cpp:96:       
Could set new shape: [1,5]

prompt = "Alan Turing was a"

# Download and convert model:
main_opt_model, main_hf_tokenizer, main_model_path = download_and_convert_model(MODEL_UNDER_TEST["name"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please parametrize model name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

Comment on lines 29 to 30
# It seems like temporary directory from model downloading stage isn't removed after test
# launch for SmolLM2-360M model, that is why it is not used now.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no temporary directory for the models in the CI. Models are downloaded to OV_CACHE path which is shared between workflows.
You can use SmolLM2-360M here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

@AsyaPronina
Copy link
Contributor Author

Here is a commit to restrict StatefulSpeculativeLLMPipeline to work only if one of the models is requested to be executed on NPU currently: efcbb66

Please add thumbs-up to approve it or comment if you don't agree: @dmatveev @as-suvorov @sbalandi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: continuous batching Continuous batching category: CPP API Changes in GenAI C++ public headers category: GGUF GGUF file reader category: llm_bench Label for tool/llm_bench folder category: LLM samples GenAI LLM samples category: LLM LLM pipeline (stateful, static) category: sampling Sampling / Decoding algorithms category: speculative decoding Speculative decoding no-match-files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants