-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Reproduction
Hi, I have a question/concern about using GRPOTrainer for multi-step agent training in vLLM server mode.
In grpo_trainer.py, inside the server-mode generation path, there is this line:
# At this point, we only get 1 copy of each prompt, so we need to repeat them num_generations times
all_prompt_ids = [ids for ids in all_prompt_ids for _ in range(self.num_generations)]This works correctly when:
- using normal one-shot generation, or
- using a
rollout_functhat produces single-step completions (one completion per generation).
However, for multi-step agents, the trajectory is produced over multiple turns, and each turn has a different prefix (because environment/tool responses get appended to the prompt at every step).
In that case, each rollout has its own sequence of step-wise prefixes, which leads to different prompt_ids per rollout.
Because server-mode GRPO duplicates the original prompt and pairs it with all G completions, we effectively lose all per-step prefixes. This breaks the assumptions behind importance sampling and causes the IS ratio to be computed against the wrong behavior policy.
In other words:
- multi-step trajectories require different
prompt_idsfor each rollout, because the agent conditions on different environment states at each decision point, - but the trainer forces one shared prompt per dataset example, duplicated
num_generationstimes, - so step-wise importance sampling becomes mathematically incorrect in server mode.
Is multi-step agent training intended to be supported in server mode?
Or should the trainer be extended to accept per-rollout (or per-step) prefixes instead of forcing all rollouts to share the same prompt_ids?
Thanks!
System Info
- Platform: Linux-5.15.0-152-generic-x86_64-with-glibc2.35
- Python version: 3.10.12
- TRL version: 0.23.1
- PyTorch version: 2.8.0
- accelerator(s): NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000, NVIDIA RTX A6000
- Transformers version: 4.56.2
- Accelerate version: 1.10.1
- Accelerate config: not found
- Datasets version: 4.2.0
- HF Hub version: 0.35.2
- bitsandbytes version: 0.48.1
- DeepSpeed version: 0.18.0
- Diffusers version: not installed
- Liger-Kernel version: not installed
- LLM-Blender version: not installed
- OpenAI version: 1.109.1
- PEFT version: 0.17.1
- vLLM version: 0.11.0
Checklist
- I have checked that my issue isn't already filed (see open issues)
- I have included my system information
- Any code provided is minimal, complete, and reproducible (more on MREs)
- Any code provided is properly formatted in code blocks, (no screenshot, more on code blocks)
- Any traceback provided is complete