-
Notifications
You must be signed in to change notification settings - Fork 300
[NPU] Add batch_size support for embedding model
#2986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[NPU] Add batch_size support for embedding model
#2986
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds batch_size support for embedding models, allowing users to control batch processing during embedding generation. The change enables batch size configuration through a command-line argument and propagates it through the evaluation pipeline to the embedding model configuration.
Key Changes:
- Added
--batch_sizecommand-line argument for configuring batch processing - Integrated batch size parameter into embedding evaluator and model loader pipelines
- Modified embedding evaluation logic to handle potential shape mismatches between gold and prediction data
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/who_what_benchmark/whowhatbench/wwb.py | Added batch_size argument parser and propagated the parameter through evaluator creation and kwargs |
| tools/who_what_benchmark/whowhatbench/embeddings_evaluator.py | Added batch_size parameter to evaluator initialization and implemented batching logic for data processing |
| tools/who_what_benchmark/whowhatbench/model_loaders.py | Added batch_size configuration to the GenAI embedding pipeline loader |
| tools/llm_bench/llm_bench_utils/ov_utils.py | Added batch_size configuration to the GenAI text embedding model creation |
| tools/who_what_benchmark/whowhatbench/whowhat_metrics.py | Added trimming logic to handle shape mismatches between gold and prediction embeddings |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1b58a22 to
5be2b0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
batch_size support for embedding modelbatch_size support for embedding model
| if gold_data.shape[0] != prediction_data.shape[0]: | ||
| print(f"Warning: gold_data rows = {gold_data.shape[0]}, prediction_data rows = {prediction_data.shape[0]}") | ||
|
|
||
| min_len = min(gold_data.shape[0], prediction_data.shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it is needed? why are reference batch and predicted batch different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When batch_size is set, it might generate token with the subset of dataset_documents as input, like something below.
docs_to_embed = dataset_documents[: config.batch_size] if config.batch_size else dataset_documents
however, it seems the reference dataset is generated with the entire dataset_documents in wwb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I still don't understand why it's needed. Do you want to make possibility to compare ground truth data, which was generated without batch and output, which was generated with batch ? What is reference dataset ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to make possibility to compare ground truth data, which was generated without batch and output, which was generated with batch ? What is reference dataset ?
Yes, it might have the scenario that the reference data is generated without batch and target data is with batch. In wwb CI, it seems that the reference data is pre-generated without batch setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, how we could find out the situation when batch size is not applied ? maybe we can send batch size to evaluate() ?
or maybe the data can be regenerated if the point only in a CI ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, how we could find out the situation when batch size is not applied ? maybe we can send batch size to evaluate() ? or maybe the data can be regenerated if the point only in a CI ?
I don't think I can control the CI behavior and what can I do is to adapt to CI requirements. so I take the method "we can send batch size to evaluate()" to inference with proper data-set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you speak about genAI CI and tests for wwb ? you can update tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scenario 1:
- Reference data generated with batch size 10
- Predictions are generated with batch size 5
- We compare reference[:5] and predicted[:5] data and getting 0.9 similarity
Scenario 2:
- Reference data generated with batch size 5
- Predictions are generated with batch size 10
- We compare reference[:5] and predicted[:5] data and getting 0.9 similarity
For scenario 2 predictions generation batch_size > 5 is no longer meaningful. We would always get same similarity.
I think this results manipulation is confusing. Could you please explain your use case for different batch sizes for reference and predictions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I encountered the NPU WWB failure mentioned in https://jira.devtools.intel.com/browse/EISW-191591. I just thought that the reference data may be pre-generated without batch and shared between devices(GPU,CPU,NPU) from the log https://jira.devtools.intel.com/secure/attachment/5635768/WW41-2025.4.0-20128_job_LO_NPU_acc_wwb_wwb_ref_nat_vs_genai_WIN_NPU_LNL_5.log . It may be not true. If the batch can be set when generating refer data for device, the scenario of different batch between reference and prediction will no exist. I will look into WWB test. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted this change.
| "If the base/target model is a HuggingFace model ID, gguf-file should be a relative path.", | ||
| ) | ||
|
|
||
| parser.add_argument('-bs', '--batch_size', type=int, default=None, help='Batch size value') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add test on CPU at least in precommit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in test_rag.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's wrong place for wwb tests. Please test wwb --batch_size in wwb tests.
| if padding_side: | ||
| config.padding_side = padding_side | ||
|
|
||
| config.batch_size = kwargs.get("batch_size", config.batch_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall it be documented and added to help. Tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this option batch_size is already there in benchmark.
parser.add_argument('-bs', '--batch_size', type=int, default=1, required=False, help='Batch size value')
| passages.append(data[0]) | ||
|
|
||
| batch_size = self.batch_size or len(data[0]) | ||
| data_input = data[0][:batch_size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be the behavior if the chunk of input data is less than the batch size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant inside the plugin, there's no point in line min(batch_size, len(data[0])), it will be taken the maximum elements the list can give. But if we set batch 10, but send 8 as input - what will plugin do ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A exception will throw if batch and data-size are not match in text-embedding-pipeline. I also added an assert check as below.
+ assert batch_size <= len(data[0]), \
+ f"batch_size ({batch_size}) cannot be greater than data length ({len(data[0])})"
I don't know if it is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss this before making changes.
If I understand correctly, the plugin will crash if we say the batch is 10, but provide 7 as input.
We can't always control the input data; potentially a dataset can contain different lengths. I'd suggest not rasing exaption, but adding logic so that if real input data batch is smaller, wwb duplicate the data to the end.
For example:
batch size = 5
input passages ['a', 'b', 'c']
it not appropriate for us, so we make input data ['a', 'b', 'c', 'a', 'b']
@mengweiguo @as-suvorov What do you think about this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Currently if we fix TextEmbeddingPipeline with batch_size=10 the pipeline would fail if number of documents passed != 10. But it's not plugin related it's genai implementation limitation. We plan to fix it in the next release. I like the data duplication approach
| if gold_data.shape[0] != prediction_data.shape[0]: | ||
| print(f"Warning: gold_data rows = {gold_data.shape[0]}, prediction_data rows = {prediction_data.shape[0]}") | ||
|
|
||
| min_len = min(gold_data.shape[0], prediction_data.shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I still don't understand why it's needed. Do you want to make possibility to compare ground truth data, which was generated without batch and output, which was generated with batch ? What is reference dataset ?
f99dee5 to
f42abcf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f42abcf to
7e59fcb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
afaef0d to
fba9cbf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if gold_data.shape[0] != prediction_data.shape[0]: | ||
| print(f"Warning: gold_data rows = {gold_data.shape[0]}, prediction_data rows = {prediction_data.shape[0]}") | ||
|
|
||
| min_len = min(gold_data.shape[0], prediction_data.shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scenario 1:
- Reference data generated with batch size 10
- Predictions are generated with batch size 5
- We compare reference[:5] and predicted[:5] data and getting 0.9 similarity
Scenario 2:
- Reference data generated with batch size 5
- Predictions are generated with batch size 10
- We compare reference[:5] and predicted[:5] data and getting 0.9 similarity
For scenario 2 predictions generation batch_size > 5 is no longer meaningful. We would always get same similarity.
I think this results manipulation is confusing. Could you please explain your use case for different batch sizes for reference and predictions?
| "If the base/target model is a HuggingFace model ID, gguf-file should be a relative path.", | ||
| ) | ||
|
|
||
| parser.add_argument('-bs', '--batch_size', type=int, default=None, help='Batch size value') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's wrong place for wwb tests. Please test wwb --batch_size in wwb tests.
fba9cbf to
af6a54d
Compare
| "--genai", | ||
| ]) | ||
|
|
||
| def test_embeddings_with_batch(model_id, model_type, tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you run this test locally?
model_id and model_type parameters seems to be missing. I expect this test to fail.
Let's parametrize this test with batch_size parameter instead of hardcoded value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and tests passed locally. Is it needed to introduce setup_module/teardown_module to do model conversion only once?
| '-bs', '--batch_size', | ||
| type=int, | ||
| default=None, | ||
| help='Batch size value') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbalandi do we want to propagate batch_size to other types of tasks? I'm thinking if we need to make this parameter task specific like embeds_batch_size or potentially rag_batch_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I would make it more specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Description
The model
qwen3-embedding-0.6Bfailed onwwbtest on NPU due to dynamicbatch size. This PR addsbatch_sizeoption support for this model inllm-benchmarkandwwb.CVS-176378
Checklist: