-
Notifications
You must be signed in to change notification settings - Fork 30.6k
Description
Summary
Despite issue #40136 being marked as resolved, the significant accuracy regression in Qwen2.5-VL-7B-Instruct
model persists in the latest Transformers version 4.56.2
. Our testing shows a more significant drop ~26% relative accuracy drop on MMMU Literature benchmark that was reported in the original issue.
Problem Description
The Qwen2.5-VL-7B-Instruct model shows
inconsistent and degraded performance on multimodal evaluation benchmarks when using recent Transformers versions (4.54.0+), despite PR #40490 claiming to fix this issue.
Observed Results
| Transformers Version | MMMU Literature Accuracy | Relative Change |
|---------------------|--------------------------|-----------------|
| 4.53.3 | 93.33% ± 4.63% | Baseline |
| 4.56.2 | 70.00% ± 8.51% | -25.0% relative |
Reproduction Steps
Environment Setup
uv pip install lm-eval torch torchvision accelerate Pillow transformers==4.56.2
Evaluation Command
lm_eval \
--model hf-multimodal \
--model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,dtype=bfloat16,add_bos_token=True,convert_img_format=True" \
--tasks mmmu_val_literature \
--num_fewshot 0 \
--batch_size 8 \
--verbosity INFO
Impact
This regression affects:
- Production systems using Qwen2.5-VL for visual question answering
- Research benchmarks and evaluations
- Any multimodal applications relying on Qwen2.5-VL models
The ~25% relative accuracy drop represents a significant degradation that makes the model substantially less reliable for downstream applications.
Expected Behavior
The model should maintain consistent accuracy across Transformers versions, as observed with v4.53.3 (93.33% accuracy).
Actual Behavior
The model shows degraded performance in v4.56.2 (70.00% accuracy), indicating the original issue was not fully resolved.
Additional Context
- Original Issue: Qwen2.5-VL-7B-Instruct: Significant accuracy regression on MMMU benchmark with transformers >=4.54.0 #40136 (marked as resolved)
- Claimed Fix: PR [qwen-vl] fix position ids #40490
- Hardware: NVIDIA GPU with CUDA support H100
- Framework: lm-eval with hf-multimodal backend
- Model: Qwen/Qwen2.5-VL-7B-Instruct
- Task: MMMU Literature benchmark
- Evaluation Consistency: Fixed random seeds used across all tests
Request
Please reopen investigation into this regression as the issue appears to persist despite the claimed resolution. The consistent ~25% accuracy drop indicates a systematic issue that needs addressing.
Test Environment Details:
- Python 3.12
- CUDA-enabled environment
- Multiple test runs with identical configurations
- Fixed random seeds for reproducibility
Raw Test Logs
Transformers 4.56.2 Results
hf-multimodal (pretrained=Qwen/Qwen2.5-VL-7B-Instruct,dtype=bfloat16,add_bos_token=True,convert_img_format=True), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 8
| Tasks |Version|Filter|n-shot|Metric| |Value| |Stderr|
|----------|------:|------|-----:|------|---|----:|---|-----:|
|Literature| 0|none | 0|acc |↑ | 0.7|± |0.0851|
Transformers 4.53.3 Results
hf-multimodal (pretrained=Qwen/Qwen2.5-VL-7B-Instruct,dtype=bfloat16,add_bos_token=True,convert_img_format=True), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 8
| Tasks |Version|Filter|n-shot|Metric| |Value | |Stderr|
|----------|------:|------|-----:|------|---|-----:|---|-----:|
|Literature| 0|none | 0|acc |↑ |0.9333|± |0.0463|
System Info
transformers
version: 4.56.2- Platform: Linux-5.14.0-611.el9.x86_64-x86_64-with-glibc2.34
- Python version: 3.12.11
- Huggingface_hub version: 0.35.1
- Safetensors version: 0.6.2
- Accelerate version: 1.10.1
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: No
- Using GPU in script?: No
- GPU type: NVIDIA H100 80GB HBM3
Environment Setup
uv pip install lm-eval torch torchvision accelerate Pillow transformers==4.56.2
Evaluation Command
lm_eval \
--model hf-multimodal \
--model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,dtype=bfloat16,add_bos_token=True,convert_img_format=True" \
--tasks mmmu_val_literature \
--num_fewshot 0 \
--batch_size 8 \
--verbosity INFO
Expected behavior
The model should maintain consistent accuracy across Transformers versions, as observed with v4.53.3 (93.33% accuracy).