Skip to content

Commit 0726977

Browse files
docs: Add Beyond the 80/20 Rule (2506.01939) to Paper Index (#4580)
Co-authored-by: Quentin Gallouédec <[email protected]>
1 parent 9731d08 commit 0726977

File tree

1 file changed

+38
-0
lines changed

1 file changed

+38
-0
lines changed

docs/source/paper_index.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -98,6 +98,44 @@ trainer = GRPOTrainer(
9898
)
9999
```
100100

101+
### Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
102+
103+
**📜 Paper**: https://huggingface.co/papers/2506.01939
104+
105+
A minority of tokens with high entropy act as reasoning "forks" in the CoT path, driving exploration and performance gains for RLVR, while low-entropy majority tokens contribute little or even impede learning. RLVR mainly adjusts high-entropy tokens, largely preserving the base model’s overall entropy patterns. Thus landing on the 80/20 rule, training on only 20% of the tokens with the highest entropy is comparable or supasses full-gradient updates for Qwen3 models.
106+
107+
The paper's main results use vanilla DAPO (⚠️ Dynamic Sampling is not supported in TRL). To replicate the main results, use the following configuration:
108+
109+
```python
110+
from trl import GRPOConfig, GRPOTrainer
111+
from trl.rewards import get_soft_overlong_punishment
112+
113+
training_args = GRPOConfig(
114+
# --- vanilla DAPO parameters (80/20 rule: section 5.2) --- #
115+
# Overlong Filtering
116+
mask_truncated_completions=True,
117+
# Token-level Loss
118+
loss_type="dapo",
119+
# Clip-Higher
120+
epsilon_high=0.28, # DAPO paper: section 4.1
121+
epsilon=0.2, # DAPO paper: section 4.1
122+
# Other parameters used
123+
per_device_train_batch_size=512, # mini-batch size for training in the paper, DAPO paper: section 4.1
124+
num_generations=16, # number of sample responses in the paper, DAPO paper: section 4.1
125+
max_completion_length=20480, # maximum number of tokens for generation in the paper, DAPO paper: section 4.1
126+
beta=0.0, # section 2.3, DAPO paper
127+
# --- Gradients on the highest entropy tokens --- #
128+
top_entropy_quantile=0.2
129+
)
130+
# Soft Overlong Punishment
131+
sop_reward = get_soft_overlong_punishment(max_completion_len=20480, soft_punish_cache=4096) # DAPO paper: section 4.1
132+
trainer = GRPOTrainer(
133+
...,
134+
args=training_args,
135+
reward_funcs=[..., sop_reward],
136+
)
137+
```
138+
101139
### Dr. GRPO: Understanding R1-Zero-Like Training: A Critical Perspective
102140

103141
**📜 Paper**: https://huggingface.co/papers/2503.20783

0 commit comments

Comments
 (0)