Skip to content

Multi-turn PPO Training Feature Requests #264

@yuxuandexter

Description

@yuxuandexter

Summary

During multi-turn PPO training, we’d like better hyperparameter customization and new PPO feature support.trianing implemenation.


Feature 1: More Diverse KL Divergence Approximation

Allow users to flexibly choose how KL divergence is calculated in compute_kl_divergence (in tunix/rl/common.py).


Feature 2: Entropy in Policy Loss


Feature 3: Asymmetric & Dual Clipping in Policy Loss


Feature 4: Support Custom completion_mask in process_ids

In tunix/rl/common.py, the process_ids function always calls make_completion_mask, generating a new mask from the first eos token. This discards any pre-built completion mask passed in.
Multi-turn PPO training uses full trajectories that often contain multiple eos tokens. The default behavior stops at the first eos, effectively truncating the trajectory and invalidating multi-turn context.

Feature 5: Add More Metrics and Organize Logs by Category in W&B

  • Log richer metrics:
    • old_value → min, mean, max
    • advantage → min, mean, max
    • return → min, mean, max
    • entropy, vf_loss, etc.
  • Organize metrics into logical groups
    • train/old_value/min, train/old_value/mean, train/old_value/max
    • train/advantage/mean, train/advantage/min, train/advantage/max
    • eval/return/mean, actor/entropy, critic/vf_loss, etc.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions