-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Great work!
Was wondering how directly training with L-GRPO would do, as because the length penalty only applies to prompts with 100% accuracy it seems that it could be fine to train entirely with it. Also have you tried other reward strategies like directly multiplying the reward by 1 - Li/Lmax?
Thanks
Metadata
Metadata
Assignees
Labels
No labels