[OpenEnv] multi step training issue in GRPO trainer

### Reproduction

Hi, I have a question/concern about using `GRPOTrainer` for multi-step agent training in **vLLM server mode**.

In `grpo_trainer.py`, inside the server-mode generation path, there is this line:

```python
# At this point, we only get 1 copy of each prompt, so we need to repeat them num_generations times
all_prompt_ids = [ids for ids in all_prompt_ids for _ in range(self.num_generations)]
```

This works correctly when:

* using normal one-shot generation, or
* using a `rollout_func` that produces **single-step** completions (one completion per generation).

However, for **multi-step agents**, the trajectory is produced over multiple turns, and each turn has a different prefix (because environment/tool responses get appended to the prompt at every step).
In that case, **each rollout has its own sequence of step-wise prefixes**, which leads to different `prompt_ids` per rollout.

Because server-mode GRPO duplicates the original prompt and pairs it with all G completions, we effectively lose all per-step prefixes. This breaks the assumptions behind importance sampling and causes the IS ratio to be computed against the wrong behavior policy.

In other words:

* multi-step trajectories require **different `prompt_ids` for each rollout**, because the agent conditions on different environment states at each decision point,
* but the trainer forces **one shared prompt per dataset example**, duplicated `num_generations` times,
* so step-wise importance sampling becomes mathematically incorrect in server mode.

Is multi-step agent training intended to be supported in server mode?
Or should the trainer be extended to accept per-rollout (or per-step) prefixes instead of forcing all rollouts to share the same `prompt_ids`?

Thanks!

### System Info

- Platform: Linux-5.15.0-152-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- TRL version: 0.23.1
- PyTorch version: 2.8.0
- accelerator(s): NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000
- Transformers version: 4.56.2
- Accelerate version: 1.10.1
- Accelerate config: not found
- Datasets version: 4.2.0
- HF Hub version: 0.35.2
- bitsandbytes version: 0.48.1
- DeepSpeed version: 0.18.0
- Diffusers version: not installed
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 1.109.1
- PEFT version: 0.17.1
- vLLM version: 0.11.0

### Checklist

- [x] I have checked that my issue isn't already filed (see [open issues](https://github.com/huggingface/trl/issues?q=is%3Aissue))
- [x] I have included my system information
- [x] Any code provided is minimal, complete, and reproducible ([more on MREs](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any code provided is properly formatted in code blocks, (no screenshot, [more on code blocks](https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks))
- [x] Any traceback provided is complete

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OpenEnv] multi step training issue in GRPO trainer #4543

Reproduction

System Info

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[OpenEnv] multi step training issue in GRPO trainer #4543

Description

Reproduction

System Info

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions