Skip to content

[OpenEnv] multi step training issue in GRPO trainer #4543

@akshay-gup

Description

@akshay-gup

Reproduction

Hi, I have a question/concern about using GRPOTrainer for multi-step agent training in vLLM server mode.

In grpo_trainer.py, inside the server-mode generation path, there is this line:

# At this point, we only get 1 copy of each prompt, so we need to repeat them num_generations times
all_prompt_ids = [ids for ids in all_prompt_ids for _ in range(self.num_generations)]

This works correctly when:

  • using normal one-shot generation, or
  • using a rollout_func that produces single-step completions (one completion per generation).

However, for multi-step agents, the trajectory is produced over multiple turns, and each turn has a different prefix (because environment/tool responses get appended to the prompt at every step).
In that case, each rollout has its own sequence of step-wise prefixes, which leads to different prompt_ids per rollout.

Because server-mode GRPO duplicates the original prompt and pairs it with all G completions, we effectively lose all per-step prefixes. This breaks the assumptions behind importance sampling and causes the IS ratio to be computed against the wrong behavior policy.

In other words:

  • multi-step trajectories require different prompt_ids for each rollout, because the agent conditions on different environment states at each decision point,
  • but the trainer forces one shared prompt per dataset example, duplicated num_generations times,
  • so step-wise importance sampling becomes mathematically incorrect in server mode.

Is multi-step agent training intended to be supported in server mode?
Or should the trainer be extended to accept per-rollout (or per-step) prefixes instead of forcing all rollouts to share the same prompt_ids?

Thanks!

System Info

  • Platform: Linux-5.15.0-152-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • TRL version: 0.23.1
  • PyTorch version: 2.8.0
  • accelerator(s): NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000
  • Transformers version: 4.56.2
  • Accelerate version: 1.10.1
  • Accelerate config: not found
  • Datasets version: 4.2.0
  • HF Hub version: 0.35.2
  • bitsandbytes version: 0.48.1
  • DeepSpeed version: 0.18.0
  • Diffusers version: not installed
  • Liger-Kernel version: not installed
  • LLM-Blender version: not installed
  • OpenAI version: 1.109.1
  • PEFT version: 0.17.1
  • vLLM version: 0.11.0

Checklist

  • I have checked that my issue isn't already filed (see open issues)
  • I have included my system information
  • Any code provided is minimal, complete, and reproducible (more on MREs)
  • Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
  • Any traceback provided is complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    🏋 GRPORelated to GRPO🐛 bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions