Fix Qwen2.5VL temporal grid positions#45400
Fix Qwen2.5VL temporal grid positions#45400zucchini-nlp wants to merge 9 commits intohuggingface:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| grid_thw[2].item() // spatial_merge_size, | ||
| ) | ||
|
|
||
| image_seq_length = llm_grid_h * llm_grid_w * llm_grid_t |
There was a problem hiding this comment.
fix repo from qwen2-vl, here and after this
|
[For maintainers] Suggested jobs to run (before merge) run-slow: ernie4_5_vl_moe, glm46v, glm4v, glm4v_moe, glm_image, glm_ocr, paddleocr_vl, qwen2_5_vl, qwen2_vl, qwen3_5, qwen3_5_moe, qwen3_vl, qwen3_vl_moe |
|
run-slow: qwen2_vl, qwen2_5_vl, glm4v, qwen3_vl, ernie4_5_vl_moe |
|
This comment contains models: ["models/ernie4_5_vl_moe", "models/glm4v", "models/qwen2_5_vl", "models/qwen2_vl", "models/qwen3_vl"] |
| def get_rope_index( | ||
| self, | ||
| input_ids: torch.LongTensor, | ||
| mm_token_type_ids: torch.IntTensor, |
There was a problem hiding this comment.
same thing, just a bit shorter and easier to follow. Copied from 'qwen3-vl'
vasqu
left a comment
There was a problem hiding this comment.
Imo, looks good I just have a few remarks to get some details in + let's really check for all models please like glm image as well. No need to be sparse about running tests here
1 concern: we change one integration test, just wanna make sure this is a proper fix and not to just align with this fix
| # Repeat the positions per each grid and per video frame. Add start position for temporal grid | ||
| # Important to add start positions after applying `time_interval`, order matters |
There was a problem hiding this comment.
Let's move this comment above, was thinking that directly on the first arange for temporal
| ) | ||
| position_temporal = torch.full((image_seq_length,), start_position, device=device, dtype=torch.long) | ||
| position_temporal = position_temporal * time_interval | ||
| position_temporal = torch.arange(llm_grid_t, device=device, dtype=torch.long) * time_interval |
There was a problem hiding this comment.
Temporal the only dtype = long one?
|
|
||
| # Repeat the positions per each grid and per video frame. Add start position for temporal grid | ||
| # Important to add start positions after applying `time_interval`, order matters | ||
| position_temporal = position_temporal.repeat_interleave(llm_grid_h * llm_grid_w) + start_position |
There was a problem hiding this comment.
I remember some device sync stuff under compile: Would it make more sense to adopt the + start_position outside the arange for everyone, i.e. width and height (e.g. position_height = torch.arange(0, llm_grid_h, device=device) + start_position)
| # Important to add start positions after applying `time_interval`, order matters | ||
| position_temporal = position_temporal.repeat_interleave(llm_grid_h * llm_grid_w) + start_position | ||
| position_width = position_width.repeat(llm_grid_h * llm_grid_t) | ||
| position_height = position_height.repeat_interleave(llm_grid_w).repeat(llm_grid_t) |
There was a problem hiding this comment.
We should note that the repeat patterns are important as well imo
There was a problem hiding this comment.
Not the same as the other ones?
|
|
||
| def test_vision_position_ids(self): | ||
| """ | ||
| Tests that vision position ids are built correctly for images and for videos. |
There was a problem hiding this comment.
Lets add a reference to the issue
| { | ||
| (None, None): [ | ||
| 'system\nYou are a helpful assistant.\nuser\nWhat is shown in this video?\nassistant\nThe video shows an indoor tennis court with a person standing on the service line, preparing to serve. The individual is wearing athletic attire, including a white', | ||
| 'system\nYou are a helpful assistant.\nuser\nWhat is shown in this video?\nassistant\nThe video shows two individuals playing tennis on an indoor court. The player in the foreground, dressed in a white shirt and black shorts, is preparing to', |
There was a problem hiding this comment.
So this is intentional, was it changed before and we just went along?
What does this PR do?
Fixes #45381 but it is weird, I remember checking position ids by value as well in qwen2.5 to verify that time-interval works 🤔
update: i know why, the integration test we have uses
second_grid_its = 0.083which rounds to0.0. So multiplication is zero no matter what value we get for vision positions. Great!For most models we didn't see any diff because each frame is separated by a timestamps, and is processed separately. Only the first two Qwen releases have a bulk processing for all frames at once
In any case, worth adding a fast test with expected positions, will do so