Skip to content

Conversation

@mengweiguo
Copy link

@mengweiguo mengweiguo commented Nov 10, 2025

Description

The model qwen3-embedding-0.6B failed on wwb test on NPU due to dynamic batch size. This PR adds batch_size option support for this model in llm-benchmark and wwb.

CVS-176378

Checklist:

  • Tests have been updated or added to cover the new code.
  • This patch fully addresses the ticket.
  • I have made corresponding changes to the documentation.

Copilot AI review requested due to automatic review settings November 10, 2025 02:42
@github-actions github-actions bot added category: llm_bench Label for tool/llm_bench folder category: WWB PR changes WWB labels Nov 10, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds batch_size support for embedding models, allowing users to control batch processing during embedding generation. The change enables batch size configuration through a command-line argument and propagates it through the evaluation pipeline to the embedding model configuration.

Key Changes:

  • Added --batch_size command-line argument for configuring batch processing
  • Integrated batch size parameter into embedding evaluator and model loader pipelines
  • Modified embedding evaluation logic to handle potential shape mismatches between gold and prediction data

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tools/who_what_benchmark/whowhatbench/wwb.py Added batch_size argument parser and propagated the parameter through evaluator creation and kwargs
tools/who_what_benchmark/whowhatbench/embeddings_evaluator.py Added batch_size parameter to evaluator initialization and implemented batching logic for data processing
tools/who_what_benchmark/whowhatbench/model_loaders.py Added batch_size configuration to the GenAI embedding pipeline loader
tools/llm_bench/llm_bench_utils/ov_utils.py Added batch_size configuration to the GenAI text embedding model creation
tools/who_what_benchmark/whowhatbench/whowhat_metrics.py Added trimming logic to handle shape mismatches between gold and prediction embeddings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mengweiguo mengweiguo force-pushed the add-batch-size-support branch from 1b58a22 to 5be2b0c Compare November 10, 2025 03:00
@mengweiguo mengweiguo requested a review from Copilot November 10, 2025 03:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mengweiguo mengweiguo changed the title Add batch_size support for embedding model [NPU] Add batch_size support for embedding model Nov 10, 2025
if gold_data.shape[0] != prediction_data.shape[0]:
print(f"Warning: gold_data rows = {gold_data.shape[0]}, prediction_data rows = {prediction_data.shape[0]}")

min_len = min(gold_data.shape[0], prediction_data.shape[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why it is needed? why are reference batch and predicted batch different?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When batch_size is set, it might generate token with the subset of dataset_documents as input, like something below.

   docs_to_embed = dataset_documents[: config.batch_size] if config.batch_size else dataset_documents

however, it seems the reference dataset is generated with the entire dataset_documents in wwb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I still don't understand why it's needed. Do you want to make possibility to compare ground truth data, which was generated without batch and output, which was generated with batch ? What is reference dataset ?

Copy link
Author

@mengweiguo mengweiguo Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to make possibility to compare ground truth data, which was generated without batch and output, which was generated with batch ? What is reference dataset ?

Yes, it might have the scenario that the reference data is generated without batch and target data is with batch. In wwb CI, it seems that the reference data is pre-generated without batch setting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, how we could find out the situation when batch size is not applied ? maybe we can send batch size to evaluate() ?
or maybe the data can be regenerated if the point only in a CI ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, how we could find out the situation when batch size is not applied ? maybe we can send batch size to evaluate() ? or maybe the data can be regenerated if the point only in a CI ?

I don't think I can control the CI behavior and what can I do is to adapt to CI requirements. so I take the method "we can send batch size to evaluate()" to inference with proper data-set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you speak about genAI CI and tests for wwb ? you can update tests

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenario 1:

  1. Reference data generated with batch size 10
  2. Predictions are generated with batch size 5
  3. We compare reference[:5] and predicted[:5] data and getting 0.9 similarity

Scenario 2:

  1. Reference data generated with batch size 5
  2. Predictions are generated with batch size 10
  3. We compare reference[:5] and predicted[:5] data and getting 0.9 similarity

For scenario 2 predictions generation batch_size > 5 is no longer meaningful. We would always get same similarity.
I think this results manipulation is confusing. Could you please explain your use case for different batch sizes for reference and predictions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encountered the NPU WWB failure mentioned in https://jira.devtools.intel.com/browse/EISW-191591. I just thought that the reference data may be pre-generated without batch and shared between devices(GPU,CPU,NPU) from the log https://jira.devtools.intel.com/secure/attachment/5635768/WW41-2025.4.0-20128_job_LO_NPU_acc_wwb_wwb_ref_nat_vs_genai_WIN_NPU_LNL_5.log . It may be not true. If the batch can be set when generating refer data for device, the scenario of different batch between reference and prediction will no exist. I will look into WWB test. Thanks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted this change.

"If the base/target model is a HuggingFace model ID, gguf-file should be a relative path.",
)

parser.add_argument('-bs', '--batch_size', type=int, default=None, help='Batch size value')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add test on CPU at least in precommit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in test_rag.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's wrong place for wwb tests. Please test wwb --batch_size in wwb tests.

if padding_side:
config.padding_side = padding_side

config.batch_size = kwargs.get("batch_size", config.batch_size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall it be documented and added to help. Tests?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this option batch_size is already there in benchmark.

parser.add_argument('-bs', '--batch_size', type=int, default=1, required=False, help='Batch size value')

@github-actions github-actions bot added category: GGUF GGUF file reader category: RAG RAG pipeline components labels Nov 10, 2025
passages.append(data[0])

batch_size = self.batch_size or len(data[0])
data_input = data[0][:batch_size]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be the behavior if the chunk of input data is less than the batch size?

Copy link
Author

@mengweiguo mengweiguo Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the check.

Copy link
Contributor

@sbalandi sbalandi Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant inside the plugin, there's no point in line min(batch_size, len(data[0])), it will be taken the maximum elements the list can give. But if we set batch 10, but send 8 as input - what will plugin do ?

Copy link
Author

@mengweiguo mengweiguo Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A exception will throw if batch and data-size are not match in text-embedding-pipeline. I also added an assert check as below.

+            assert batch_size <= len(data[0]), \
+                f"batch_size ({batch_size}) cannot be greater than data length ({len(data[0])})"

I don't know if it is redundant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss this before making changes.

If I understand correctly, the plugin will crash if we say the batch is 10, but provide 7 as input.
We can't always control the input data; potentially a dataset can contain different lengths. I'd suggest not rasing exaption, but adding logic so that if real input data batch is smaller, wwb duplicate the data to the end.
For example:
batch size = 5
input passages ['a', 'b', 'c']
it not appropriate for us, so we make input data ['a', 'b', 'c', 'a', 'b']

@mengweiguo @as-suvorov What do you think about this ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Currently if we fix TextEmbeddingPipeline with batch_size=10 the pipeline would fail if number of documents passed != 10. But it's not plugin related it's genai implementation limitation. We plan to fix it in the next release. I like the data duplication approach

if gold_data.shape[0] != prediction_data.shape[0]:
print(f"Warning: gold_data rows = {gold_data.shape[0]}, prediction_data rows = {prediction_data.shape[0]}")

min_len = min(gold_data.shape[0], prediction_data.shape[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I still don't understand why it's needed. Do you want to make possibility to compare ground truth data, which was generated without batch and output, which was generated with batch ? What is reference dataset ?

Copilot AI review requested due to automatic review settings November 11, 2025 02:15
@mengweiguo mengweiguo force-pushed the add-batch-size-support branch from f99dee5 to f42abcf Compare November 11, 2025 02:15
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mengweiguo mengweiguo force-pushed the add-batch-size-support branch from f42abcf to 7e59fcb Compare November 11, 2025 02:29
@mengweiguo mengweiguo marked this pull request as ready for review November 11, 2025 08:40
Copilot AI review requested due to automatic review settings November 11, 2025 08:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings November 12, 2025 01:56
@mengweiguo mengweiguo force-pushed the add-batch-size-support branch from afaef0d to fba9cbf Compare November 12, 2025 01:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if gold_data.shape[0] != prediction_data.shape[0]:
print(f"Warning: gold_data rows = {gold_data.shape[0]}, prediction_data rows = {prediction_data.shape[0]}")

min_len = min(gold_data.shape[0], prediction_data.shape[0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenario 1:

  1. Reference data generated with batch size 10
  2. Predictions are generated with batch size 5
  3. We compare reference[:5] and predicted[:5] data and getting 0.9 similarity

Scenario 2:

  1. Reference data generated with batch size 5
  2. Predictions are generated with batch size 10
  3. We compare reference[:5] and predicted[:5] data and getting 0.9 similarity

For scenario 2 predictions generation batch_size > 5 is no longer meaningful. We would always get same similarity.
I think this results manipulation is confusing. Could you please explain your use case for different batch sizes for reference and predictions?

"If the base/target model is a HuggingFace model ID, gguf-file should be a relative path.",
)

parser.add_argument('-bs', '--batch_size', type=int, default=None, help='Batch size value')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's wrong place for wwb tests. Please test wwb --batch_size in wwb tests.

@mengweiguo mengweiguo force-pushed the add-batch-size-support branch from fba9cbf to af6a54d Compare November 13, 2025 02:19
@github-actions github-actions bot removed category: GGUF GGUF file reader category: RAG RAG pipeline components labels Nov 13, 2025
"--genai",
])

def test_embeddings_with_batch(model_id, model_type, tmp_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you run this test locally?
model_id and model_type parameters seems to be missing. I expect this test to fail.
Let's parametrize this test with batch_size parameter instead of hardcoded value

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and tests passed locally. Is it needed to introduce setup_module/teardown_module to do model conversion only once?

'-bs', '--batch_size',
type=int,
default=None,
help='Batch size value')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbalandi do we want to propagate batch_size to other types of tasks? I'm thinking if we need to make this parameter task specific like embeds_batch_size or potentially rag_batch_size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I would make it more specific

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot AI review requested due to automatic review settings November 13, 2025 15:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: llm_bench Label for tool/llm_bench folder category: WWB PR changes WWB

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants