Skip to content

Conversation

cwzrad
Copy link
Contributor

@cwzrad cwzrad commented Aug 4, 2025

Details:

  • For KV cache model, support generate more than 1 token per inference, which is needed by speculative decoding.

  • Also update the KV according to the position id for fast draft.
    Basically if we already saved 20 KV cache, then the next position ID should be 20. Assume in this case we have 3 token inputs, the position id should be [20, 21, 22], after inference, we saved 3 more KV cache, it becomes 23. But after verification in application side, we find the 22 is not a correct token, then for next inference the position id is [22, 23, 24], the position id only increase 2. Then we know in previous inference, the last KV cache is a dirty one.

  • ...

Tickets:

@github-actions github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Aug 4, 2025
@cwzrad cwzrad force-pushed the generate-more-token branch from 45a6890 to 621189b Compare August 4, 2025 08:37
@dmatveev
Copy link
Contributor

dmatveev commented Aug 6, 2025

@cwzrad if you bring this for speculative decode, please synchronize with @AsyaPronina to avoid duplication

@cwzrad
Copy link
Contributor Author

cwzrad commented Aug 6, 2025

@cwzrad if you bring this for speculative decode, please synchronize with @AsyaPronina to avoid duplication

@dmatveev yes, we have a sync, holp this npuw change plus with her Genai Pipe changes openvinotoolkit/openvino.genai#2544 can make fast draft work, with some co-ebugging.

@cwzrad cwzrad force-pushed the generate-more-token branch from 68964c1 to f74a821 Compare August 11, 2025 01:16
@cwzrad cwzrad marked this pull request as ready for review August 11, 2025 13:42
@cwzrad cwzrad requested review from a team as code owners August 11, 2025 13:42
@AsyaPronina
Copy link
Contributor

AsyaPronina commented Aug 25, 2025

AR for @AsyaPronina:

  • Add alignment for input MAX_GENERATION_TOKEN_LEN
  • Sort out if we need to return MAX_GENERATION_TOKEN_LEN for prefill stage only (I think we do). Should work automatically with 3-model pipeline.
  • Remove extra initialization in prepare_for_new_conversation().

@AsyaPronina AsyaPronina changed the title [NPUW]support generate more than 1 token per inference [NPUW] Support generate more than 1 token per inference Aug 25, 2025
@dmatveev dmatveev added this to the 2025.4 milestone Aug 26, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of changes in this file, I expect a thorough review from @AsyaPronina here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed -> finalized, thanks!

// number of candidates. To differentiate prefill and generate
// calls for main model, we just check that start position id
// is 0, meaning this is the first input prompt.
if (input_ids->get_shape()[INPUT_IDS_SEQ_LEN_DIM] > 1 && position_ids->data<int64_t>()[0] == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DM: How it will work with prefix caching?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like that should work

{char{0x4c}, char{0x4c}, char{0x4d}, char{0x43}, char{0x4d}, char{0x4f}};

const constexpr char* NPUW_SERIALIZATION_VERSION = "0.8";
const constexpr char* NPUW_SERIALIZATION_VERSION = "0.10";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we skip 0.9? or 0.9 comes with the prefix caching?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it doesn't make sense then, would you set it back to 0.9 once the prefix caching is merged?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AsyaPronina I'd recommend to make it 0.9. First come, first served

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

@AsyaPronina AsyaPronina added this pull request to the merge queue Sep 8, 2025
Merged via the queue into openvinotoolkit:master with commit 20efbd5 Sep 8, 2025
216 of 220 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants