Skip to content

Why have 2 stages? #1

@faresobeid

Description

@faresobeid

Great work!

Was wondering how directly training with L-GRPO would do, as because the length penalty only applies to prompts with 100% accuracy it seems that it could be fine to train entirely with it. Also have you tried other reward strategies like directly multiplying the reward by 1 - Li/Lmax?

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions