Fix Qwen2.5VL temporal grid positions by zucchini-nlp · Pull Request #45400 · huggingface/transformers

zucchini-nlp · 2026-04-13T11:01:34Z

What does this PR do?

Fixes #45381 but it is weird, I remember checking position ids by value as well in qwen2.5 to verify that time-interval works 🤔

update: i know why, the integration test we have uses second_grid_its = 0.083 which rounds to 0.0. So multiplication is zero no matter what value we get for vision positions. Great!

For most models we didn't see any diff because each frame is separated by a timestamps, and is processed separately. Only the first two Qwen releases have a bulk processing for all frames at once

In any case, worth adding a fast test with expected positions, will do so

HuggingFaceDocBuilderDev · 2026-04-13T11:13:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2026-04-13T12:13:07Z

src/transformers/models/ernie4_5_vl_moe/modeling_ernie4_5_vl_moe.py

            grid_thw[2].item() // spatial_merge_size,
        )

-        image_seq_length = llm_grid_h * llm_grid_w * llm_grid_t


fix repo from qwen2-vl, here and after this

github-actions · 2026-04-13T12:13:32Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe

zucchini-nlp · 2026-04-13T12:15:28Z

run-slow: qwen2_vl, qwen2_5_vl, glm4v, qwen3_vl, ernie4_5_vl_moe

github-actions · 2026-04-13T12:16:45Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe", "models/glm4v", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_vl"]
quantizations: []

github-actions · 2026-04-13T12:34:44Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	02a2c776	workflow commit (merge commit)
PR	7039d95c	branch commit (from PR)
main	def8e6a2	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

zucchini-nlp · 2026-04-13T12:41:29Z

src/transformers/models/glm4v/modular_glm4v.py

    def get_rope_index(
        self,
-        input_ids: torch.LongTensor,
-        mm_token_type_ids: torch.IntTensor,


same thing, just a bit shorter and easier to follow. Copied from 'qwen3-vl'

vasqu

Imo, looks good I just have a few remarks to get some details in + let's really check for all models please like glm image as well. No need to be sparse about running tests here

1 concern: we change one integration test, just wanna make sure this is a proper fix and not to just align with this fix

vasqu · 2026-04-13T16:45:02Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

+        # Repeat the positions per each grid and per video frame. Add start position for temporal grid
+        # Important to add start positions after applying `time_interval`, order matters


Let's move this comment above, was thinking that directly on the first arange for temporal

vasqu · 2026-04-13T16:45:53Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

-        )
-        position_temporal = torch.full((image_seq_length,), start_position, device=device, dtype=torch.long)
-        position_temporal = position_temporal * time_interval
+        position_temporal = torch.arange(llm_grid_t, device=device, dtype=torch.long) * time_interval


Temporal the only dtype = long one?

vasqu · 2026-04-13T16:48:18Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

+
+        # Repeat the positions per each grid and per video frame. Add start position for temporal grid
+        # Important to add start positions after applying `time_interval`, order matters
+        position_temporal = position_temporal.repeat_interleave(llm_grid_h * llm_grid_w) + start_position


I remember some device sync stuff under compile: Would it make more sense to adopt the + start_position outside the arange for everyone, i.e. width and height (e.g. position_height = torch.arange(0, llm_grid_h, device=device) + start_position)

vasqu · 2026-04-13T16:49:09Z

src/transformers/models/qwen2_vl/modeling_qwen2_vl.py

+        # Important to add start positions after applying `time_interval`, order matters
+        position_temporal = position_temporal.repeat_interleave(llm_grid_h * llm_grid_w) + start_position
+        position_width = position_width.repeat(llm_grid_h * llm_grid_t)
+        position_height = position_height.repeat_interleave(llm_grid_w).repeat(llm_grid_t)


We should note that the repeat patterns are important as well imo

vasqu · 2026-04-13T16:50:51Z

src/transformers/models/glm_image/modeling_glm_image.py

Not the same as the other ones?

vasqu · 2026-04-13T16:51:38Z

tests/models/glm4v/test_modeling_glm4v.py


+    def test_vision_position_ids(self):
+        """
+        Tests that vision position ids are built correctly for images and for videos.


Lets add a reference to the issue

vasqu · 2026-04-13T16:52:18Z

tests/models/qwen2_5_vl/test_modeling_qwen2_5_vl.py

            {
                (None, None): [
-                    'system\nYou are a helpful assistant.\nuser\nWhat is shown in this video?\nassistant\nThe video shows an indoor tennis court with a person standing on the service line, preparing to serve. The individual is wearing athletic attire, including a white',
+                    'system\nYou are a helpful assistant.\nuser\nWhat is shown in this video?\nassistant\nThe video shows two individuals playing tennis on an indoor court. The player in the foreground, dressed in a white shirt and black shorts, is preparing to',


So this is intentional, was it changed before and we just went along?

zucchini-nlp added 3 commits April 13, 2026 12:55

interesting

08d58a7

oops

2cb32bd

test uses better temporal positions now

771df9f

zucchini-nlp added 3 commits April 13, 2026 13:17

Merge remote-tracking branch 'upstream/main' into qwen-time-positions

fed5b10

fix repo

3f41256

re-unite glm and qwen3-vl

321fc0b

zucchini-nlp added the for patch Tag issues / labels that should be included in the next patch label Apr 13, 2026

add some fast tests

a1a56ab

zucchini-nlp commented Apr 13, 2026

View reviewed changes

zucchini-nlp added 2 commits April 13, 2026 14:13

dummy import

202c137

missed another dummy import

7039d95

zucchini-nlp changed the title ~~Qwen2.5VL temporal grid postions~~ Fix Qwen2.5VL temporal grid positions Apr 13, 2026

zucchini-nlp requested a review from vasqu April 13, 2026 12:40

zucchini-nlp commented Apr 13, 2026

View reviewed changes

vasqu approved these changes Apr 13, 2026

View reviewed changes

		# Repeat the positions per each grid and per video frame. Add start position for temporal grid
		# Important to add start positions after applying `time_interval`, order matters

Conversation

zucchini-nlp commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Apr 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

zucchini-nlp commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

github-actions bot commented Apr 13, 2026

CI Results

Commit Info

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zucchini-nlp commented Apr 13, 2026 •

edited

Loading