-
Notifications
You must be signed in to change notification settings - Fork 241
Support VLM calibration with image-text data #755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #755 +/- ##
==========================================
- Coverage 74.66% 73.17% -1.50%
==========================================
Files 192 193 +1
Lines 18975 19352 +377
==========================================
- Hits 14167 14160 -7
- Misses 4808 5192 +384 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
…for Nemotron-VLM-Dataset-v2 Signed-off-by: Zhiyu Cheng <[email protected]>
…for Nemotron-VLM-Dataset-v2 Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: Zhiyu Cheng <[email protected]>
|
So, we only support image quantization for just nemotron-vl? If yes, why? |
| # limitations under the License. | ||
|
|
||
| """Utility functions for getting samples and forward loop function for different vlm datasets.""" | ||
| """Utility functions for getting samples and dataloader for different VLM calibration datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajrasane could you review this change?
|
@Edwardf0t1 do you have experiments evaluating the accuracy impact of using the new dataset? |
At this time, only Nemotron VL has been tested. We can extend the logic to support other VLMs later. Note that different VLMs may have different forward functions—e.g., the way the vision encoder interacts with the language decoder can vary across models. Do you have a preferred VL model you’d like us to support next? For instance, Qwen3-VL? |
Signed-off-by: Zhiyu Cheng <[email protected]>
Tested on two benchmarks DocVQA and InfoVQA for Nemotron Nano VL v2 with vLLM backend:
Image-text calibration is only marginally better in these cases, but the calibration flow in this PR should be ready. The follow-up experiments can be
|
|
|
||
| [PTQ for DeepSeek](../deepseek/README.md) shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM. | ||
|
|
||
| #### VLM calibration with image-text pairs (e.g., Nemotron VL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you feel this can fall into: https://github.com/NVIDIA/Model-Optimizer/tree/main/examples/vlm_ptq?
| --qformat nvfp4 \ | ||
| --export_path <quantized_ckpt_path> \ | ||
| --trust_remote_code \ | ||
| --calib_with_images \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: Can user choose which vlm dataset to use or we just provide one option
| calib_dataloader = None | ||
| first_text_speech_dataset = None | ||
| if model_type == "mllama": | ||
| if getattr(args, "calib_with_images", False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use getattr here?
| ): | ||
| """Auto search quantization of multiple formats.""" | ||
|
|
||
| if getattr(args, "calib_with_images", False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. And why not just use assert?
| ) | ||
| elif is_nemotron_vl_model and getattr(args, "calib_with_images", False): | ||
| # For Nemotron VL image calibration, we need an AutoProcessor to build multimodal inputs. | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this try except?
| tokenizer.padding_side = "left" | ||
|
|
||
| # Quantize only the language model, but keep the full_model for calibration forward. | ||
| language_model_lineage = get_language_model_from_vl(full_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please avoid duplicated codes with below
| # Those kwargs must be consumed by the *full* VLM model, not the extracted language_model. | ||
| if getattr(args, "calib_with_images", False) and is_nemotron_vl_model: | ||
|
|
||
| def calibrate_full_model(_model): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make these helper functions and move it output hf_ptq
| # prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | ||
| # inputs = processor(text=[prompt], images=[pil_image], ...) | ||
|
|
||
| def _collate_fn(examples: list[dict[str, Any]]) -> dict[str, torch.Tensor] | dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to introduce these while the original one does not?
jingyu-ml
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I only reviewed the dataset processing part, which behaves as expected, loading the dataset on demand rather than downloading the entire dataset.
What does this PR do?
Type of change: New feature
Overview:
The primary goal of this PR is to allow the model optimizer to use image-text pair data during the calibration phase of quantization, which is likely help improve accuracy of quantized VLMs like Nemotron VL on visual understanding tasks particularly, compared to text-only calibration data.
Nemotron-VLM-Dataset-v2.hf_ptq.py) clean.Nemotron-Nano-VL-12B-V2model with image data.This PR complements #347 and we will consolidate llm_ptq and vlm_ptq examples in follow-up PRs.
Usage
Testing
Before your PR is "Ready for review"
Additional Information