Multi image support for GRPO/RLOO #4113

qgallouedec · 2025-09-19T22:48:39Z

This PR is the second of a sequence of PR (after #4111) that aims to refactor the generation part of GRPO/RLOO to allow for easier customization.

While refactoring, I realized that having a clean multi-image support help having a cleaner separation between functions.

try with

from datasets import load_dataset

from trl import GRPOConfig, GRPOTrainer


# If not handled properly, prompt truncation may truncate image token
dataset = load_dataset("trl-internal-testing/zen-multi-image", "conversational_prompt_only", split="train")

dataset = dataset.filter(lambda x: len(x["images"]) > 0) # currently, mixing samples with and without images is not supported

def my_reward_function(prompts, completions, **kwargs):
    return [1.0] * len(prompts)

training_args = GRPOConfig(
    output_dir="tmp_dir",   
    learning_rate=0.1,  # increase the learning rate to speed up the test
    per_device_train_batch_size=6,  # reduce the batch size to reduce memory usage
    num_generations=3,  # reduce the number of generations to reduce memory usage
    max_completion_length=8,  # reduce the completion length to reduce memory usage
    max_prompt_length=32,
    report_to="none",
)
trainer = GRPOTrainer(
    model="Qwen/Qwen2-VL-2B-Instruct",
    reward_funcs=my_reward_function,  # define a dummy reward function
    args=training_args,
    train_dataset=dataset,
)

previous_trainable_params = {n: param.clone() for n, param in trainer.model.named_parameters()}

trainer.train()

HuggingFaceDocBuilderDev · 2025-09-19T22:53:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-09-20T00:22:58Z

tests/test_grpo_trainer.py

        )
        trainer = GRPOTrainer(
            model=model_id,
-            reward_funcs="trl-internal-testing/tiny-Qwen2ForSequenceClassification-2.5",


we don't support visual reward model, so it doesn't really make sense to test this case, where the image is dropped and a warning is raised.

qgallouedec · 2025-09-20T00:25:37Z

trl/trainer/grpo_trainer.py

+                        # VLM reward models aren't supported yet, so we drop the image and raise a warning if needed
+                        for prompt in prompts:
+                            for turn in prompt:
+                                if isinstance(turn["content"], list):
+                                    logger.warning_once("Visual reward models aren't supported yet; dropping image.")
+                                    turn["content"] = " ".join(
+                                        e["text"] for e in turn["content"] if e["type"] == "text"
+                                    )


from

[{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "What color is the sky?"}]}]

to

[{"role": "user", "content": "What color is the sky?"}]

plus raise warning

qgallouedec · 2025-09-20T00:26:35Z

trl/trainer/grpo_trainer.py

-        # We don't yet support visual reward models/function, so we keep a copy of the original text-only prompts for
-        # later use in the reward computation. If images are present, we insert {"type": "image"} as required by the
-        # VLM chat template.
-        original_prompts = copy.deepcopy(prompts)


instead of keeping the original prompt, we just drop the image later, and raise a warning, see https://github.com/huggingface/trl/pull/4113/files#r2364899902

qgallouedec · 2025-09-20T00:26:59Z

trl/trainer/grpo_trainer.py

        # important because rewards will be normalized per group, and completions are distributed. We will later slice
        # rewards_per_func to extract each process's subset.
-        rewards_per_func = self._calculate_rewards(inputs, original_prompts, completions, completion_ids_list)
+        rewards_per_func = self._calculate_rewards(inputs, prompts, completions, completion_ids_list)


see https://github.com/huggingface/trl/pull/4113/files#r2364900545

qgallouedec · 2025-09-20T00:28:40Z

trl/trainer/grpo_trainer.py

+                if self._logs["images"]:
+                    table["images"] = []
+                    for image_list in self._logs["images"]:
+                        # Convert images to wandb Image objects for proper visualization
+                        table["images"].append([wandb.Image(image) for image in image_list])


qgallouedec · 2025-09-20T00:31:24Z

trl/trainer/utils.py

+    boundaries = [0, *accumulate(batch["num_images"])]  # [3, 4, 5] -> [0, 3, 7, 12]
+    sections = [sum(lengths[boundaries[i] : boundaries[i + 1]]) for i in range(len(batch["num_images"]))]
+    split_values = list(torch.split(batch["pixel_values"], sections, dim=0))
+    image_grid_thw = list(torch.split(batch["image_grid_thw"], batch["num_images"], dim=0))
+    return {**batch, "pixel_values": split_values, "image_grid_thw": image_grid_thw}


instead of keeping image_grid_thw as is, we need to split it depending on the number of images. It gets concatenated later in _get_per_token_logps_and_entropies (see line 807)

qgallouedec · 2025-09-20T00:32:09Z

trl/trainer/grpo_trainer.py

+                model_inputs["image_grid_thw"] = torch.cat(image_grid_thw[start : start + batch_size])
+                start_pixel_idx = 0 if start == 0 else torch.cat(image_grid_thw[:start]).prod(-1).sum().item()
+                end_pixel_idx = torch.cat(image_grid_thw[: start + batch_size]).prod(-1).sum().item()


See https://github.com/huggingface/trl/pull/4113/files#r2364904060, image_grid_thw is not a tensor anymore, but a list of tensor

lewtun

LGTM with a question about whether raising an error vs a warning is best when images + text are being passed to the reward function

lewtun · 2025-09-22T12:28:01Z

tests/test_grpo_trainer.py

+        # Because of the way the tiny models are initialized, the gradient does not flow properly through the
+        # vision parts of the model, so we skip them. Ideally, we should fix the init of these models.
+        params_to_skip = (
+            # "model.vision_tower.",


These are commented out - restore?

lewtun · 2025-09-22T12:28:58Z

tests/test_rloo_trainer.py

+
+        self.assertIsNotNone(trainer.state.log_history[-1]["train_loss"])
+
+        for n, param in previous_trainable_params.items():


Does the same comment for GRPO apply here? https://github.com/huggingface/trl/pull/4113/files#diff-96dca172e696190fc3e1469166e88aface95ebae959284c6806f2e25d2217c16R1587

lewtun · 2025-09-22T12:33:38Z

trl/trainer/grpo_trainer.py

+                        for prompt in prompts:
+                            for turn in prompt:
+                                if isinstance(turn["content"], list):
+                                    logger.warning_once("Visual reward models aren't supported yet; dropping image.")


Would raising an error be better than a warning? Otherwise I could imagine the warning could be missed and the training "fails silently" because the reward is only computed on the text part.

lewtun · 2025-09-22T12:35:47Z

trl/trainer/rloo_trainer.py

+                    table["images"] = []
+                    for image_list in self._logs["images"]:
+                        # Convert images to wandb Image objects for proper visualization
+                        table["images"].append([wandb.Image(image) for image in image_list])


At some point it would be nice to also add the trackio variant for table images

multi-image grpo

229c554

qgallouedec changed the base branch from main to drop-image_split_sizes September 19, 2025 22:53

qgallouedec added 2 commits September 19, 2025 23:31

log with wandb

3ca6ad5

no vlm reward models

dcf4b92

qgallouedec commented Sep 20, 2025

View reviewed changes

qgallouedec added 3 commits September 20, 2025 00:37

rloo

30ad7ca

gfpo

86cc30b

fix

088897b

qgallouedec changed the title ~~Multi image support for GRPO/RLOO~~ [WIP] Multi image support for GRPO/RLOO Sep 20, 2025

qgallouedec added 6 commits September 20, 2025 02:52

test peft

d2adc63

fix gfpo

f4c82bf

rloo test

1257796

peft rloo

099a39b

oops

529add6

update test

fc6b11f

qgallouedec mentioned this pull request Sep 20, 2025

Refactor GRPO to isolate _generate #4114

Open

qgallouedec and others added 4 commits September 20, 2025 05:18

debug

f998432

skip failing test

fa73876

Merge branch 'drop-image_split_sizes' into multi-image-support

dfc0d38

test fixed!

fc52e68

qgallouedec changed the title ~~[WIP] Multi image support for GRPO/RLOO~~ Multi image support for GRPO/RLOO Sep 20, 2025

lewtun approved these changes Sep 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi image support for GRPO/RLOO #4113

Multi image support for GRPO/RLOO #4113

qgallouedec commented Sep 19, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Sep 19, 2025

Uh oh!

qgallouedec Sep 20, 2025 •

edited

Loading

Uh oh!

qgallouedec Sep 20, 2025 •

edited

Loading

Uh oh!

qgallouedec Sep 20, 2025

Uh oh!

qgallouedec Sep 20, 2025

Uh oh!

qgallouedec Sep 20, 2025

Uh oh!

qgallouedec Sep 20, 2025

Uh oh!

qgallouedec Sep 20, 2025 •

edited

Loading

Uh oh!

qgallouedec Sep 20, 2025

Uh oh!

lewtun left a comment

Uh oh!

lewtun Sep 22, 2025

Uh oh!

lewtun Sep 22, 2025

Uh oh!

lewtun Sep 22, 2025

Uh oh!

lewtun Sep 22, 2025

Uh oh!

Uh oh!


		self.assertIsNotNone(trainer.state.log_history[-1]["train_loss"])

		for n, param in previous_trainable_params.items():

Multi image support for GRPO/RLOO #4113

Are you sure you want to change the base?

Multi image support for GRPO/RLOO #4113

Conversation

qgallouedec commented Sep 19, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Sep 19, 2025

Uh oh!

qgallouedec Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qgallouedec Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lewtun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

qgallouedec Sep 20, 2025 •

edited

Loading

qgallouedec Sep 20, 2025 •

edited

Loading

qgallouedec Sep 20, 2025 •

edited

Loading