Skip to content

Fix Qwen2.5VL temporal grid positions#45400

Open
zucchini-nlp wants to merge 9 commits intohuggingface:mainfrom
zucchini-nlp:qwen-time-positions
Open

Fix Qwen2.5VL temporal grid positions#45400
zucchini-nlp wants to merge 9 commits intohuggingface:mainfrom
zucchini-nlp:qwen-time-positions

Conversation

@zucchini-nlp
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp commented Apr 13, 2026

What does this PR do?

Fixes #45381 but it is weird, I remember checking position ids by value as well in qwen2.5 to verify that time-interval works 🤔

update: i know why, the integration test we have uses second_grid_its = 0.083 which rounds to 0.0. So multiplication is zero no matter what value we get for vision positions. Great!

For most models we didn't see any diff because each frame is separated by a timestamps, and is processed separately. Only the first two Qwen releases have a bulk processing for all frames at once

In any case, worth adding a fast test with expected positions, will do so

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp added the for patch Tag issues / labels that should be included in the next patch label Apr 13, 2026
grid_thw[2].item() // spatial_merge_size,
)

image_seq_length = llm_grid_h * llm_grid_w * llm_grid_t
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix repo from qwen2-vl, here and after this

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe

@zucchini-nlp
Copy link
Copy Markdown
Member Author

run-slow: qwen2_vl, qwen2_5_vl, glm4v, qwen3_vl, ernie4_5_vl_moe

@github-actions
Copy link
Copy Markdown
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/ernie4_5_vl_moe", "models/glm4v", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_vl"]
quantizations: []

@zucchini-nlp zucchini-nlp changed the title Qwen2.5VL temporal grid postions Fix Qwen2.5VL temporal grid positions Apr 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 02a2c776 workflow commit (merge commit)
PR 7039d95c branch commit (from PR)
main def8e6a2 base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@zucchini-nlp zucchini-nlp requested a review from vasqu April 13, 2026 12:40
Comment on lines 880 to -884
def get_rope_index(
self,
input_ids: torch.LongTensor,
mm_token_type_ids: torch.IntTensor,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing, just a bit shorter and easier to follow. Copied from 'qwen3-vl'

Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo, looks good I just have a few remarks to get some details in + let's really check for all models please like glm image as well. No need to be sparse about running tests here

1 concern: we change one integration test, just wanna make sure this is a proper fix and not to just align with this fix

Comment on lines +981 to +982
# Repeat the positions per each grid and per video frame. Add start position for temporal grid
# Important to add start positions after applying `time_interval`, order matters
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this comment above, was thinking that directly on the first arange for temporal

)
position_temporal = torch.full((image_seq_length,), start_position, device=device, dtype=torch.long)
position_temporal = position_temporal * time_interval
position_temporal = torch.arange(llm_grid_t, device=device, dtype=torch.long) * time_interval
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporal the only dtype = long one?


# Repeat the positions per each grid and per video frame. Add start position for temporal grid
# Important to add start positions after applying `time_interval`, order matters
position_temporal = position_temporal.repeat_interleave(llm_grid_h * llm_grid_w) + start_position
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember some device sync stuff under compile: Would it make more sense to adopt the + start_position outside the arange for everyone, i.e. width and height (e.g. position_height = torch.arange(0, llm_grid_h, device=device) + start_position)

# Important to add start positions after applying `time_interval`, order matters
position_temporal = position_temporal.repeat_interleave(llm_grid_h * llm_grid_w) + start_position
position_width = position_width.repeat(llm_grid_h * llm_grid_t)
position_height = position_height.repeat_interleave(llm_grid_w).repeat(llm_grid_t)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should note that the repeat patterns are important as well imo

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the same as the other ones?


def test_vision_position_ids(self):
"""
Tests that vision position ids are built correctly for images and for videos.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a reference to the issue

{
(None, None): [
'system\nYou are a helpful assistant.\nuser\nWhat is shown in this video?\nassistant\nThe video shows an indoor tennis court with a person standing on the service line, preparing to serve. The individual is wearing athletic attire, including a white',
'system\nYou are a helpful assistant.\nuser\nWhat is shown in this video?\nassistant\nThe video shows two individuals playing tennis on an indoor court. The player in the foreground, dressed in a white shirt and black shorts, is preparing to',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is intentional, was it changed before and we just went along?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

for patch Tag issues / labels that should be included in the next patch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

transformers==5.3.0, qwen2.5-vl video input vision_position_ids seems to be wrong

3 participants