Skip to content

Qwen2.5-VL-7B-Instruct Accuracy Regression Still Persists in v4.56.2 #41180

@rahul-tuli

Description

@rahul-tuli

Summary

Despite issue #40136 being marked as resolved, the significant accuracy regression in Qwen2.5-VL-7B-Instruct model persists in the latest Transformers version 4.56.2. Our testing shows a more significant drop ~26% relative accuracy drop on MMMU Literature benchmark that was reported in the original issue.

Problem Description

The Qwen2.5-VL-7B-Instruct model shows inconsistent and degraded performance on multimodal evaluation benchmarks when using recent Transformers versions (4.54.0+), despite PR #40490 claiming to fix this issue.

Observed Results

| Transformers Version | MMMU Literature Accuracy | Relative Change |
|---------------------|--------------------------|-----------------|
| 4.53.3              | 93.33% ± 4.63%          | Baseline        |
| 4.56.2              | 70.00% ± 8.51%          | -25.0% relative |

Reproduction Steps

Environment Setup

uv pip install lm-eval torch torchvision accelerate Pillow transformers==4.56.2

Evaluation Command

lm_eval \
    --model hf-multimodal \
    --model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,dtype=bfloat16,add_bos_token=True,convert_img_format=True" \
    --tasks mmmu_val_literature \
    --num_fewshot 0 \
    --batch_size 8 \
    --verbosity INFO

Impact

This regression affects:

  • Production systems using Qwen2.5-VL for visual question answering
  • Research benchmarks and evaluations
  • Any multimodal applications relying on Qwen2.5-VL models

The ~25% relative accuracy drop represents a significant degradation that makes the model substantially less reliable for downstream applications.

Expected Behavior

The model should maintain consistent accuracy across Transformers versions, as observed with v4.53.3 (93.33% accuracy).

Actual Behavior

The model shows degraded performance in v4.56.2 (70.00% accuracy), indicating the original issue was not fully resolved.

Additional Context

Request

Please reopen investigation into this regression as the issue appears to persist despite the claimed resolution. The consistent ~25% accuracy drop indicates a systematic issue that needs addressing.


Test Environment Details:

  • Python 3.12
  • CUDA-enabled environment
  • Multiple test runs with identical configurations
  • Fixed random seeds for reproducibility

Raw Test Logs

Transformers 4.56.2 Results
hf-multimodal (pretrained=Qwen/Qwen2.5-VL-7B-Instruct,dtype=bfloat16,add_bos_token=True,convert_img_format=True), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 8
|  Tasks   |Version|Filter|n-shot|Metric|   |Value|   |Stderr|
|----------|------:|------|-----:|------|---|----:|---|-----:|
|Literature|      0|none  |     0|acc   |↑  |  0.7|±  |0.0851|
Transformers 4.53.3 Results
hf-multimodal (pretrained=Qwen/Qwen2.5-VL-7B-Instruct,dtype=bfloat16,add_bos_token=True,convert_img_format=True), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 8
|  Tasks   |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|----------|------:|------|-----:|------|---|-----:|---|-----:|
|Literature|      0|none  |     0|acc   |↑  |0.9333|±  |0.0463|

System Info

  • transformers version: 4.56.2
  • Platform: Linux-5.14.0-611.el9.x86_64-x86_64-with-glibc2.34
  • Python version: 3.12.11
  • Huggingface_hub version: 0.35.1
  • Safetensors version: 0.6.2
  • Accelerate version: 1.10.1
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.8.0+cu128 (CUDA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?: No
  • Using GPU in script?: No
  • GPU type: NVIDIA H100 80GB HBM3

Environment Setup

uv pip install lm-eval torch torchvision accelerate Pillow transformers==4.56.2

Evaluation Command

lm_eval \
    --model hf-multimodal \
    --model_args "pretrained=Qwen/Qwen2.5-VL-7B-Instruct,dtype=bfloat16,add_bos_token=True,convert_img_format=True" \
    --tasks mmmu_val_literature \
    --num_fewshot 0 \
    --batch_size 8 \
    --verbosity INFO

Expected behavior

The model should maintain consistent accuracy across Transformers versions, as observed with v4.53.3 (93.33% accuracy).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions