-
Notifications
You must be signed in to change notification settings - Fork 284
Adding non-CB pipeline for Speculative Decoding #2544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Adding non-CB pipeline for Speculative Decoding #2544
Conversation
3cba904
to
cd6151a
Compare
|
||
void remove_last_generated_tokens(const std::size_t tokens_to_remove); | ||
|
||
void trimm_kv_cache(const std::size_t tokens_to_remove); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it must be trim_
I believe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
25a6e69
to
842d773
Compare
I have a common question: is it able to scale to CPU/GPU as well, if we want to run speculative on non CB mode? |
0a0a00d
to
2092313
Compare
Hello! It works on CPU now. But GenAI will choose this non-CB Speculative Decoding path only if we mentioned NPU in main or draft or both models. Do we need to change it to allow to run on other devices? |
I think it depends on whether we have scenarios that non CB path can have perf out-perf CB path? BTW, does it plan to support eagle speculative decoding as well? |
6388eff
to
9d5890a
Compare
0a4bf49
to
164bdf2
Compare
// FIXME: Do we need it? | ||
// void StatefulLLMPipelineNPU::reset_kv_state() { | ||
// m_pimpl->reset_kv_state(); | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't forget to address it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
#include "openvino/genai/visual_language/pipeline.hpp" | ||
#include "openvino/runtime/core.hpp" | ||
|
||
#include "openvino/genai/generation_handle.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate include of 'openvino/genai/generation_handle.hpp' on line 14 and removed line. The include on line 14 should be removed as it's already included properly with the other includes.
#include "openvino/genai/generation_handle.hpp" |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I think I will leave as is
#include "visual_language/processor_config.hpp" | ||
|
||
#include "openvino/genai/generation_handle.hpp" | ||
#include "openvino/genai/streamer_base.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate include of 'openvino/genai/generation_handle.hpp' on line 14 and removed line. The include on line 14 should be removed as it's already included properly with the other includes.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I think I will leave as is
if (matches_num == m_candidates_num) { | ||
m_candidates_num = std::min(m_candidates_num + 2, m_max_candidates_num); | ||
} else { | ||
m_candidates_num = static_cast<std::size_t>(std::max(static_cast<int64_t>(m_candidates_num) - 1, int64_t(1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cast to int64_t and back to size_t is unnecessarily complex. Since we're ensuring the result is at least 1, this can be simplified to: m_candidates_num = std::max(m_candidates_num - 1, size_t(1));
m_candidates_num = static_cast<std::size_t>(std::max(static_cast<int64_t>(m_candidates_num) - 1, int64_t(1))); | |
m_candidates_num = std::max(m_candidates_num - 1, std::size_t(1)); |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If m_candidates_num
is unsigned, then m_candidates_num - 1
could lead to overflow and be chosen from set of {overflowed_number, 1}
what is not intended here
auto& main_perf_generated_tokens = m_main_request->raw_perf_metrics.m_batch_sizes.back(); | ||
main_perf_generated_tokens -= mismatched_candidates; | ||
m_sd_metrics.update_draft_generated_len(0 /* request_id */, candidates_to_generate); | ||
m_sd_metrics.update_acceptance_rate(0 /* request_id */, (accepted_tokens_number / candidates_to_generate) * 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Integer division will always result in 0 or 1 before multiplication by 100. This should be: (accepted_tokens_number * 100.0) / candidates_to_generate
to get the correct percentage calculation.
m_sd_metrics.update_acceptance_rate(0 /* request_id */, (accepted_tokens_number / candidates_to_generate) * 100); | |
m_sd_metrics.update_acceptance_rate(0 /* request_id */, (accepted_tokens_number * 100.0) / candidates_to_generate); |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! Fixed, thanks!
# NOTE: ContinuousBatching backend uses `num_assistant_tokens` as is. Stateful backend uses `num_assistant_tokens`'s copy as initial | ||
# value and adjusts it based on recent number of accepted tokens. If `num_assistant_tokens` is not set, it defaults to `5` for both | ||
# backends. | ||
# config.num_assistant_tokens = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment suggests this line is commented out, but the original version had this uncommented. This change makes the default behavior unclear to users who might expect the parameter to be set.
# config.num_assistant_tokens = 5 | |
config.num_assistant_tokens = 5 |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed after Sofia's comment
43205b4
to
8267424
Compare
8267424
to
1bf82fc
Compare
src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp
Outdated
Show resolved
Hide resolved
src/cpp/src/speculative_decoding/speculative_decoding_stateful.cpp
Outdated
Show resolved
Hide resolved
m_request.set_tensor("input_ids", new_input_ids); | ||
|
||
auto attention_mask = m_request.get_tensor("attention_mask"); | ||
ov::Tensor new_attention_mask(attention_mask.get_element_type(), ov::Shape{BATCH_SIZE, m_num_processed_tokens + tokens_size}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create a foolow-up for performance imrpovement and removing of allocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check why not just reshape : old issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue is reproduced here:
RuntimeError: Exception from src\core\src\runtime\tensor.cpp:85:
Check 'shape_size(new_shape) <= ov::shape_size(m_capacity)' failed at src\inference\src\dev\make_tensor.cpp:96:
Could set new shape: [1,5]
src/cpp/src/speculative_decoding/speculative_decoding_stateful.cpp
Outdated
Show resolved
Hide resolved
src/cpp/src/speculative_decoding/speculative_decoding_stateful.cpp
Outdated
Show resolved
Hide resolved
src/cpp/src/speculative_decoding/continuous_batching_for_speculative_decoding_impl.cpp
Show resolved
Hide resolved
prompt = "Alan Turing was a" | ||
|
||
# Download and convert model: | ||
main_opt_model, main_hf_tokenizer, main_model_path = download_and_convert_model(MODEL_UNDER_TEST["name"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please parametrize model name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
# It seems like temporary directory from model downloading stage isn't removed after test | ||
# launch for SmolLM2-360M model, that is why it is not used now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no temporary directory for the models in the CI. Models are downloaded to OV_CACHE path which is shared between workflows.
You can use SmolLM2-360M here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
f98e81a
to
e72008c
Compare
…ed for one or both the models
e72008c
to
efcbb66
Compare
Here is a commit to restrict StatefulSpeculativeLLMPipeline to work only if one of the models is requested to be executed on NPU currently: efcbb66 Please add thumbs-up to approve it or comment if you don't agree: @dmatveev @as-suvorov @sbalandi |
Uh oh!
There was an error while loading. Please reload this page.