Skip to content
Discussion options

You must be logged in to vote

Yes I get that it may not be super easy to understand the implementation details. Maybe the important point here is that we don't really need the repetition in the sampling, because when the trainer doesn't need to generate, it just ignores the samples data. See https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_trainer.py#L1007-L1008

So if you have steps_per_generation=2 then

  • Sample 1 > generate
  • Sample 1 > ignored
  • Sample 2 > generate

And so on

Replies: 2 comments 2 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
2 replies
@qgallouedec
Comment options

Answer selected by kdricci
@kdricci
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants