Why have 2 stages?

Great work!

Was wondering how directly training with L-GRPO would do, as because the length penalty only applies to prompts with 100% accuracy it seems that it could be fine to train entirely with it. Also have you tried other reward strategies like directly multiplying the reward by 1 - Li/Lmax?

Thanks