[NPUW] Support generate more than 1 token per inference #31578

cwzrad · 2025-08-04T03:00:11Z

Details:

For KV cache model, support generate more than 1 token per inference, which is needed by speculative decoding.
Also update the KV according to the position id for fast draft.
Basically if we already saved 20 KV cache, then the next position ID should be 20. Assume in this case we have 3 token inputs, the position id should be [20, 21, 22], after inference, we saved 3 more KV cache, it becomes 23. But after verification in application side, we find the 22 is not a correct token, then for next inference the position id is [22, 23, 24], the position id only increase 2. Then we know in previous inference, the last KV cache is a dirty one.
...

Tickets:

ticket-id https://jira.devtools.intel.com/browse/CVS-172014

dmatveev · 2025-08-06T00:06:49Z

@cwzrad if you bring this for speculative decode, please synchronize with @AsyaPronina to avoid duplication

cwzrad · 2025-08-06T12:04:36Z

@cwzrad if you bring this for speculative decode, please synchronize with @AsyaPronina to avoid duplication

@dmatveev yes, we have a sync, holp this npuw change plus with her Genai Pipe changes openvinotoolkit/openvino.genai#2544 can make fast draft work, with some co-ebugging.

Signed-off-by: wenzengc <[email protected]>

Supported dynamic number of output tokens for generate model only

AsyaPronina · 2025-08-25T16:28:48Z

AR for @AsyaPronina:

Add alignment for input MAX_GENERATION_TOKEN_LEN
Sort out if we need to return MAX_GENERATION_TOKEN_LEN for prefill stage only (I think we do). Should work automatically with 3-model pipeline.
Remove extra initialization in prepare_for_new_conversation().

src/plugins/intel_npu/src/al/include/intel_npu/config/npuw.hpp

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.hpp

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

dmatveev · 2025-08-26T12:39:17Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

There's a lot of changes in this file, I expect a thorough review from @AsyaPronina here

Reviewed -> finalized, thanks!

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

AsyaPronina · 2025-08-26T13:42:57Z

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp

+    //    number of candidates. To differentiate prefill and generate
+    //    calls for main model, we just check that start position id
+    //    is 0, meaning this is the first input prompt.
+    if (input_ids->get_shape()[INPUT_IDS_SEQ_LEN_DIM] > 1 && position_ids->data<int64_t>()[0] == 0) {


DM: How it will work with prefix caching?

Looks like that should work

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

[DO NOT MERGE][Validation is in progress] Polishing the PR for merge

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

dmatveev · 2025-09-05T13:24:02Z

src/plugins/intel_npu/src/plugin/npuw/serialization.hpp

    {char{0x4c}, char{0x4c}, char{0x4d}, char{0x43}, char{0x4d}, char{0x4f}};

-const constexpr char* NPUW_SERIALIZATION_VERSION = "0.8";
+const constexpr char* NPUW_SERIALIZATION_VERSION = "0.10";


why did we skip 0.9? or 0.9 comes with the prefix caching?

Yes, it should!

so it doesn't make sense then, would you set it back to 0.9 once the prefix caching is merged?

@AsyaPronina I'd recommend to make it 0.9. First come, first served

Fixed, thanks!

Co-authored-by: Dmitry Matveev <[email protected]>

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Aug 4, 2025

cwzrad force-pushed the generate-more-token branch from 45a6890 to 621189b Compare August 4, 2025 08:37

cwzrad and others added 3 commits August 11, 2025 09:14

support generate more than 1 token per inference

f0f8ea5

Signed-off-by: wenzengc <[email protected]>

Fixes for 3-model pipeline

25fb056

fix clang format

f74a821

Signed-off-by: wenzengc <[email protected]>

cwzrad force-pushed the generate-more-token branch from 68964c1 to f74a821 Compare August 11, 2025 01:16

cwzrad marked this pull request as ready for review August 11, 2025 13:42

cwzrad requested review from a team as code owners August 11, 2025 13:42

AsyaPronina and others added 2 commits August 20, 2025 02:15

Supported dynamic number of output tokens for generate model only

517bfca

Merge pull request #2 from AsyaPronina/generate-more-token

7fd66e1

Supported dynamic number of output tokens for generate model only

Merge branch 'master' into generate-more-token

627c016

smirnov-alexey reviewed Aug 25, 2025

View reviewed changes

src/plugins/intel_npu/src/al/include/intel_npu/config/npuw.hpp Show resolved Hide resolved

Fixed clang-format

0f14e3f

smirnov-alexey reviewed Aug 25, 2025

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.hpp Show resolved Hide resolved

AsyaPronina changed the title ~~[NPUW]support generate more than 1 token per inference~~ [NPUW] Support generate more than 1 token per inference Aug 25, 2025

smirnov-alexey reviewed Aug 25, 2025

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp Show resolved Hide resolved

dmatveev added this to the 2025.4 milestone Aug 26, 2025

dmatveev assigned AsyaPronina Aug 26, 2025

dmatveev reviewed Aug 26, 2025

View reviewed changes

AsyaPronina reviewed Aug 26, 2025

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp Show resolved Hide resolved

AsyaPronina reviewed Aug 26, 2025

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp Show resolved Hide resolved

AsyaPronina and others added 3 commits September 3, 2025 18:05

Merge branch 'master' into generate-more-token

8dc56a6

Align number of outputs of generate model to NPU-friendly power of two

c3fc6ba

Polishing the PR

137db08

AsyaPronina and others added 3 commits September 4, 2025 12:39

Fixed review comment and build issues

5ad3d5a

Removed extra changes

56e4aaa

Merge pull request #3 from AsyaPronina/generate-more-token

924a2e4

[DO NOT MERGE][Validation is in progress] Polishing the PR for merge

dmatveev reviewed Sep 5, 2025

View reviewed changes

AsyaPronina and others added 3 commits September 5, 2025 15:47

Applied review comments

1fed6a2

Co-authored-by: Dmitry Matveev <[email protected]>

Update llm_compiled_model.cpp

8493155

Update llm_infer_request.cpp

d93bb29

smirnov-alexey reviewed Sep 8, 2025

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp Show resolved Hide resolved

smirnov-alexey approved these changes Sep 8, 2025

View reviewed changes

Blob version fixed to 0.9

71c8525

AsyaPronina enabled auto-merge September 8, 2025 14:35

AsyaPronina added this pull request to the merge queue Sep 8, 2025

Merged via the queue into openvinotoolkit:master with commit 20efbd5 Sep 8, 2025
216 of 220 checks passed

[NPUW] Support generate more than 1 token per inference #31578

[NPUW] Support generate more than 1 token per inference #31578

Uh oh!

Conversation

cwzrad commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

Uh oh!

dmatveev commented Aug 6, 2025

Uh oh!

cwzrad commented Aug 6, 2025

Uh oh!

AsyaPronina commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cwzrad commented Aug 4, 2025 •

edited

Loading

AsyaPronina commented Aug 25, 2025 •

edited

Loading