[NPU] Add `batch_size` support for embedding model #2986

mengweiguo · 2025-11-10T02:42:02Z

Description

The model qwen3-embedding-0.6B failed on wwb test on NPU due to dynamic batch size. This PR adds batch_size option support for this model in llm-benchmark and wwb.

CVS-176378

Checklist:

Tests have been updated or added to cover the new code.
This patch fully addresses the ticket.
I have made corresponding changes to the documentation.

Copilot

Pull Request Overview

This PR adds batch_size support for embedding models, allowing users to control batch processing during embedding generation. The change enables batch size configuration through a command-line argument and propagates it through the evaluation pipeline to the embedding model configuration.

Key Changes:

Added --batch_size command-line argument for configuring batch processing
Integrated batch size parameter into embedding evaluator and model loader pipelines
Modified embedding evaluation logic to handle potential shape mismatches between gold and prediction data

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tools/who_what_benchmark/whowhatbench/wwb.py	Added batch_size argument parser and propagated the parameter through evaluator creation and kwargs
tools/who_what_benchmark/whowhatbench/embeddings_evaluator.py	Added batch_size parameter to evaluator initialization and implemented batching logic for data processing
tools/who_what_benchmark/whowhatbench/model_loaders.py	Added batch_size configuration to the GenAI embedding pipeline loader
tools/llm_bench/llm_bench_utils/ov_utils.py	Added batch_size configuration to the GenAI text embedding model creation
tools/who_what_benchmark/whowhatbench/whowhat_metrics.py	Added trimming logic to handle shape mismatches between gold and prediction embeddings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tools/who_what_benchmark/whowhatbench/embeddings_evaluator.py

tools/who_what_benchmark/whowhatbench/whowhat_metrics.py

tools/llm_bench/llm_bench_utils/ov_utils.py

Copilot

Pull Request Overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rkazants · 2025-11-10T07:25:00Z

tools/who_what_benchmark/whowhatbench/whowhat_metrics.py

+            if gold_data.shape[0] != prediction_data.shape[0]:
+                print(f"Warning: gold_data rows = {gold_data.shape[0]}, prediction_data rows = {prediction_data.shape[0]}")
+
+            min_len = min(gold_data.shape[0], prediction_data.shape[0])


why it is needed? why are reference batch and predicted batch different?

When batch_size is set, it might generate token with the subset of dataset_documents as input, like something below.

docs_to_embed = dataset_documents[: config.batch_size] if config.batch_size else dataset_documents

however, it seems the reference dataset is generated with the entire dataset_documents in wwb

Sorry, I still don't understand why it's needed. Do you want to make possibility to compare ground truth data, which was generated without batch and output, which was generated with batch ? What is reference dataset ?

Do you want to make possibility to compare ground truth data, which was generated without batch and output, which was generated with batch ? What is reference dataset ?

Yes, it might have the scenario that the reference data is generated without batch and target data is with batch. In wwb CI, it seems that the reference data is pre-generated without batch setting.

ok, how we could find out the situation when batch size is not applied ? maybe we can send batch size to evaluate() ?
or maybe the data can be regenerated if the point only in a CI ?

ok, how we could find out the situation when batch size is not applied ? maybe we can send batch size to evaluate() ? or maybe the data can be regenerated if the point only in a CI ?

I don't think I can control the CI behavior and what can I do is to adapt to CI requirements. so I take the method "we can send batch size to evaluate()" to inference with proper data-set.

do you speak about genAI CI and tests for wwb ? you can update tests

Scenario 1:

Reference data generated with batch size 10

Predictions are generated with batch size 5

We compare reference[:5] and predicted[:5] data and getting 0.9 similarity

Scenario 2:

Reference data generated with batch size 5

Predictions are generated with batch size 10

We compare reference[:5] and predicted[:5] data and getting 0.9 similarity

For scenario 2 predictions generation batch_size > 5 is no longer meaningful. We would always get same similarity.
I think this results manipulation is confusing. Could you please explain your use case for different batch sizes for reference and predictions?

I encountered the NPU WWB failure mentioned in https://jira.devtools.intel.com/browse/EISW-191591. I just thought that the reference data may be pre-generated without batch and shared between devices(GPU,CPU,NPU) from the log https://jira.devtools.intel.com/secure/attachment/5635768/WW41-2025.4.0-20128_job_LO_NPU_acc_wwb_wwb_ref_nat_vs_genai_WIN_NPU_LNL_5.log . It may be not true. If the batch can be set when generating refer data for device, the scenario of different batch between reference and prediction will no exist. I will look into WWB test. Thanks.

Reverted this change.

rkazants · 2025-11-10T07:25:24Z

tools/who_what_benchmark/whowhatbench/wwb.py

        "If the base/target model is a HuggingFace model ID, gguf-file should be a relative path.",
    )
-
+    parser.add_argument('-bs', '--batch_size', type=int, default=None, help='Batch size value')


please add test on CPU at least in precommit

Added in test_rag.py

It's wrong place for wwb tests. Please test wwb --batch_size in wwb tests.

rkazants · 2025-11-10T07:25:53Z

tools/llm_bench/llm_bench_utils/ov_utils.py

    if padding_side:
        config.padding_side = padding_side

+    config.batch_size = kwargs.get("batch_size", config.batch_size)


shall it be documented and added to help. Tests?

Actually this option batch_size is already there in benchmark.

parser.add_argument('-bs', '--batch_size', type=int, default=1, required=False, help='Batch size value')

tools/who_what_benchmark/whowhatbench/model_loaders.py

sbalandi · 2025-11-10T11:52:56Z

tools/who_what_benchmark/whowhatbench/embeddings_evaluator.py

-            passages.append(data[0])
+
+            batch_size = self.batch_size or len(data[0])
+            data_input = data[0][:batch_size]


What will be the behavior if the chunk of input data is less than the batch size?

Added the check.

I meant inside the plugin, there's no point in line min(batch_size, len(data[0])), it will be taken the maximum elements the list can give. But if we set batch 10, but send 8 as input - what will plugin do ?

A exception will throw if batch and data-size are not match in text-embedding-pipeline. I also added an assert check as below.

+ assert batch_size <= len(data[0]), \ + f"batch_size ({batch_size}) cannot be greater than data length ({len(data[0])})"

I don't know if it is redundant.

Let's discuss this before making changes.

If I understand correctly, the plugin will crash if we say the batch is 10, but provide 7 as input.
We can't always control the input data; potentially a dataset can contain different lengths. I'd suggest not rasing exaption, but adding logic so that if real input data batch is smaller, wwb duplicate the data to the end.
For example:
batch size = 5
input passages ['a', 'b', 'c']
it not appropriate for us, so we make input data ['a', 'b', 'c', 'a', 'b']

@mengweiguo @as-suvorov What do you think about this ?

Good point. Currently if we fix TextEmbeddingPipeline with batch_size=10 the pipeline would fail if number of documents passed != 10. But it's not plugin related it's genai implementation limitation. We plan to fix it in the next release. I like the data duplication approach

tools/who_what_benchmark/whowhatbench/model_loaders.py

sbalandi · 2025-11-10T12:59:16Z

tools/who_what_benchmark/whowhatbench/whowhat_metrics.py

+            if gold_data.shape[0] != prediction_data.shape[0]:
+                print(f"Warning: gold_data rows = {gold_data.shape[0]}, prediction_data rows = {prediction_data.shape[0]}")
+
+            min_len = min(gold_data.shape[0], prediction_data.shape[0])


Sorry, I still don't understand why it's needed. Do you want to make possibility to compare ground truth data, which was generated without batch and output, which was generated with batch ? What is reference dataset ?

Copilot

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tools/who_what_benchmark/whowhatbench/wwb.py

tools/who_what_benchmark/whowhatbench/whowhat_metrics.py

Copilot

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tools/who_what_benchmark/whowhatbench/whowhat_metrics.py

tests/python_tests/test_rag.py

as-suvorov · 2025-11-12T14:14:02Z

tools/who_what_benchmark/whowhatbench/whowhat_metrics.py

+            if gold_data.shape[0] != prediction_data.shape[0]:
+                print(f"Warning: gold_data rows = {gold_data.shape[0]}, prediction_data rows = {prediction_data.shape[0]}")
+
+            min_len = min(gold_data.shape[0], prediction_data.shape[0])


Scenario 1:

Reference data generated with batch size 10

Predictions are generated with batch size 5

We compare reference[:5] and predicted[:5] data and getting 0.9 similarity

Scenario 2:

Reference data generated with batch size 5

Predictions are generated with batch size 10

We compare reference[:5] and predicted[:5] data and getting 0.9 similarity

For scenario 2 predictions generation batch_size > 5 is no longer meaningful. We would always get same similarity.
I think this results manipulation is confusing. Could you please explain your use case for different batch sizes for reference and predictions?

as-suvorov · 2025-11-12T14:15:56Z

tools/who_what_benchmark/whowhatbench/wwb.py

        "If the base/target model is a HuggingFace model ID, gguf-file should be a relative path.",
    )
-
+    parser.add_argument('-bs', '--batch_size', type=int, default=None, help='Batch size value')


It's wrong place for wwb tests. Please test wwb --batch_size in wwb tests.

tests/python_tests/test_rag.py

as-suvorov · 2025-11-13T09:20:35Z

tools/who_what_benchmark/tests/test_cli_embeddings.py

        "--genai",
    ])
+
+def test_embeddings_with_batch(model_id, model_type, tmp_path):


Did you run this test locally?
model_id and model_type parameters seems to be missing. I expect this test to fail.
Let's parametrize this test with batch_size parameter instead of hardcoded value

Updated and tests passed locally. Is it needed to introduce setup_module/teardown_module to do model conversion only once?

as-suvorov · 2025-11-13T09:22:20Z

tools/who_what_benchmark/whowhatbench/wwb.py

+        '-bs', '--batch_size',
+        type=int,
+        default=None,
+        help='Batch size value')


@sbalandi do we want to propagate batch_size to other types of tasks? I'm thinking if we need to make this parameter task specific like embeds_batch_size or potentially rag_batch_size

yes, I would make it more specific

Updated. Rename batch_size to embeds_batch_size.

Copilot

Pull Request Overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tools/who_what_benchmark/whowhatbench/embeddings_evaluator.py

Copilot AI review requested due to automatic review settings November 10, 2025 02:42

github-actions bot added category: llm_bench Label for tool/llm_bench folder category: WWB PR changes WWB labels Nov 10, 2025

Copilot AI reviewed Nov 10, 2025

View reviewed changes

tools/who_what_benchmark/whowhatbench/embeddings_evaluator.py Outdated Show resolved Hide resolved

tools/who_what_benchmark/whowhatbench/whowhat_metrics.py Outdated Show resolved Hide resolved

tools/llm_bench/llm_bench_utils/ov_utils.py Outdated Show resolved Hide resolved

mengweiguo force-pushed the add-batch-size-support branch from 1b58a22 to 5be2b0c Compare November 10, 2025 03:00

mengweiguo requested a review from Copilot November 10, 2025 03:01

Copilot AI reviewed Nov 10, 2025

View reviewed changes

mengweiguo changed the title ~~Add batch_size support for embedding model~~ [NPU] Add batch_size support for embedding model Nov 10, 2025

rkazants reviewed Nov 10, 2025

View reviewed changes

rkazants requested a review from sbalandi November 10, 2025 07:26

rkazants assigned sbalandi Nov 10, 2025

as-suvorov reviewed Nov 10, 2025

View reviewed changes

tools/who_what_benchmark/whowhatbench/model_loaders.py Show resolved Hide resolved

github-actions bot added category: GGUF GGUF file reader category: RAG RAG pipeline components labels Nov 10, 2025

sbalandi reviewed Nov 10, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings November 11, 2025 02:15

mengweiguo force-pushed the add-batch-size-support branch from f99dee5 to f42abcf Compare November 11, 2025 02:15

Copilot AI reviewed Nov 11, 2025

View reviewed changes

tools/who_what_benchmark/whowhatbench/wwb.py Outdated Show resolved Hide resolved

tools/who_what_benchmark/whowhatbench/wwb.py Outdated Show resolved Hide resolved

tools/who_what_benchmark/whowhatbench/whowhat_metrics.py Outdated Show resolved Hide resolved

mengweiguo force-pushed the add-batch-size-support branch from f42abcf to 7e59fcb Compare November 11, 2025 02:29

mengweiguo marked this pull request as ready for review November 11, 2025 08:40

Copilot AI review requested due to automatic review settings November 11, 2025 08:40

Copilot AI reviewed Nov 11, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings November 12, 2025 01:56

mengweiguo force-pushed the add-batch-size-support branch from afaef0d to fba9cbf Compare November 12, 2025 01:56

Copilot AI reviewed Nov 12, 2025

View reviewed changes

tools/who_what_benchmark/whowhatbench/whowhat_metrics.py Outdated Show resolved Hide resolved

tests/python_tests/test_rag.py Outdated Show resolved Hide resolved

as-suvorov reviewed Nov 12, 2025

View reviewed changes

mengweiguo added 2 commits November 13, 2025 09:49

Add batch_size support for embedding model

f03d41c

Add test

961530c

mengweiguo added 5 commits November 13, 2025 09:49

Fix conflict and add check for batch_size

cc8c457

Ensure having enough data to inference

7ee0bff

Revert GenAI test

30a052f

Revert metric changes

9e9110e

Add WWB test

af6a54d

mengweiguo force-pushed the add-batch-size-support branch from fba9cbf to af6a54d Compare November 13, 2025 02:19

github-actions bot removed category: GGUF GGUF file reader category: RAG RAG pipeline components labels Nov 13, 2025

as-suvorov reviewed Nov 13, 2025

View reviewed changes

Fix test

96e95b0

Copilot AI review requested due to automatic review settings November 13, 2025 15:09

Copilot AI reviewed Nov 13, 2025

View reviewed changes

tools/who_what_benchmark/whowhatbench/embeddings_evaluator.py Show resolved Hide resolved

Rename batch_size to embeds_batch_size.

58e4c4d

[NPU] Add batch_size support for embedding model #2986

Are you sure you want to change the base?

[NPU] Add batch_size support for embedding model #2986

Conversation

mengweiguo commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengweiguo Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengweiguo Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbalandi Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengweiguo Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

[NPU] Add `batch_size` support for embedding model #2986

[NPU] Add `batch_size` support for embedding model #2986

mengweiguo commented Nov 10, 2025 •

edited

Loading

mengweiguo Nov 11, 2025 •

edited

Loading

mengweiguo Nov 11, 2025 •

edited

Loading

sbalandi Nov 11, 2025 •

edited

Loading

mengweiguo Nov 12, 2025 •

edited

Loading