Skip to content

fix: skip multimodal samples that exceed seq_len instead of truncating#2064

Open
hallerite wants to merge 1 commit intomainfrom
fix/vlm-truncation-safety
Open

fix: skip multimodal samples that exceed seq_len instead of truncating#2064
hallerite wants to merge 1 commit intomainfrom
fix/vlm-truncation-safety

Conversation

@hallerite
Copy link
Member

@hallerite hallerite commented Mar 21, 2026

Summary

  • prepare_sample() now returns None for multimodal samples that exceed seq_len
  • prepare_batch() skips None samples
  • Text-only samples continue to truncate normally

Problem: Truncating a multimodal sample drops image_pad tokens from input_ids while pixel_values and image_grid_thw are passed through unchanged. This causes ValueError: Image features and image tokens do not match or silent training on corrupt data.

Fix: Skip the sample entirely with a warning log. The right long-term fix is to ensure seq_len covers your longest VLM samples.

Related to #2013 which takes a different approach (trim pixel_values to match surviving tokens). Our approach is simpler and model-agnostic — no hardcoded token IDs or merge sizes.

Test plan

  • test_prepare_sample_skips_multimodal_exceeding_seq_len — multimodal > seq_len returns None
  • test_prepare_sample_keeps_multimodal_within_seq_len — multimodal <= seq_len works normally
  • test_prepare_sample_still_truncates_text_only — text-only truncation unchanged
  • All existing batch/trajectory tests pass
  • RL color-codeword integration test (3 steps, Qwen3-VL-4B)

🤖 Generated with Claude Code


Note

Medium Risk
Changes batching behavior for VLM training by dropping overlong multimodal samples, which can affect training dynamics and effective batch size if seq_len is misconfigured. Logic is localized but sits on the training data path, so regressions could surface as reduced utilization or unexpected sample loss.

Overview
Prevents corrupt VLM training data by skipping multimodal samples whose tokenized length exceeds seq_len instead of truncating them (which can desync image tokens from pixel_values).

prepare_sample now returns None for these cases and prepare_batch filters them out while emitting warning logs; text-only samples continue to truncate normally. Documentation and unit tests were updated to reflect and enforce the new skip behavior.

Written by Cursor Bugbot for commit b314a39. This will update automatically on new commits. Configure here.

@hallerite hallerite force-pushed the fix/vlm-truncation-safety branch from d214b39 to 9c4ab23 Compare March 21, 2026 22:33
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

(idx, sample)
for idx, rollout in zip(idxs, rollouts)
if (sample := prepare_sample(rollout, seq_len)) is not None
]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping all samples causes empty batch crash

Medium Severity

When every sample in a batch is multimodal and exceeds seq_len, prepare_batch now filters them all out, producing empty micro-batch lists per worker. The training loop then crashes with an IndexError at micro_batches[0]["input_ids"]. Before this change, truncation always produced at least one micro batch, so this is a new crash path. The SinglePacker path has no upstream length validation, making it reachable in practice.

Fix in Cursor Fix in Web

@hallerite hallerite force-pushed the fix/vlm-truncation-safety branch from 9c4ab23 to 8ab17e4 Compare March 21, 2026 23:44
Truncating a multimodal sample breaks the alignment between image_pad
tokens in input_ids and the pixel_values tensor. Instead, skip such
samples with a warning. Text-only samples continue to truncate normally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hallerite hallerite force-pushed the fix/vlm-truncation-safety branch from 8ab17e4 to b314a39 Compare March 22, 2026 00:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant