Skip to content

Commit 26b497e

Browse files
authored
Fix typos (#4109)
1 parent d22bdb8 commit 26b497e

File tree

2 files changed

+6
-6
lines changed

2 files changed

+6
-6
lines changed

docs/source/grpo_trainer.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -352,7 +352,7 @@ The [`GRPOTrainer`] supports using custom reward functions instead of dense rewa
352352
- `completions` (contains the generated completions),
353353
- `completions_ids` (contains the tokenized completions),
354354
- `trainer_state` ([`~transformers.TrainerState`]): The current state of the trainer. This can be used to implement dynamic reward functions, such as curriculum learning, where the reward is adjusted based on the training progress.
355-
- All columns names (but `prompt`) that the dataset may have. For example, if the dataset contains a column named `ground_truth`, the function will be called with `ground_truth` as a keyword argument.
355+
- All column names (but `prompt`) that the dataset may have. For example, if the dataset contains a column named `ground_truth`, the function will be called with `ground_truth` as a keyword argument.
356356

357357
The easiest way to comply with this requirement is to use `**kwargs` in the function signature.
358358
- Depending on the dataset format, the input will vary:
@@ -381,7 +381,7 @@ You can test it as follows:
381381
[2.0, 4.0]
382382
```
383383

384-
#### Example 1.1: Reward longer completions (based in the number of characters)
384+
#### Example 1.1: Reward longer completions (based on the number of characters)
385385

386386
Same as the previous example, but this time the reward function is based on the number of characters instead of tokens.
387387

@@ -401,10 +401,10 @@ You can test it as follows:
401401
[6.0, 12.0]
402402
```
403403

404-
#### Example 2: Reward completions with specific format
404+
#### Example 2: Reward completions with a specific format
405405

406406
Below is an example of a reward function that checks if the completion has a specific format. This example is inspired by the _format reward_ function used in the paper [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://huggingface.co/papers/2501.12948).
407-
It is designed for conversational format, where prompts and completions consist of structured messages.
407+
It is designed for a conversational format, where prompts and completions consist of structured messages.
408408

409409
```python
410410
import re
@@ -513,7 +513,7 @@ trainer = GRPOTrainer(
513513
trainer.train()
514514
```
515515

516-
In this example, the `math_reward_func` and `coding_reward_func` are designed to work with a mixed dataset that contains both math and coding problems. The `task` column in the dataset is used to determine which reward function to apply to each problem. If there is no relevant reward function for a sample in the dataset, the reward function will return `None` and the [`GRPOTrainer`] will continue with the valid functions and tasks. This allows the [`GRPOTrainer`] to handle multiple reward functions with different applicability.
516+
In this example, the `math_reward_func` and `coding_reward_func` are designed to work with a mixed dataset that contains both math and coding problems. The `task` column in the dataset is used to determine which reward function to apply to each problem. If there is no relevant reward function for a sample in the dataset, the reward function will return `None`, and the [`GRPOTrainer`] will continue with the valid functions and tasks. This allows the [`GRPOTrainer`] to handle multiple reward functions with different applicability.
517517

518518
Note that the [`GRPOTrainer`] will ignore the `None` rewards returned by the reward functions and only consider the rewards returned by the relevant functions. This ensures that the model is trained on the relevant tasks and ignores the tasks for which there is no relevant reward function.
519519

docs/source/rloo_trainer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,7 @@ While training and evaluating, we record the following reward metrics:
145145
- `completions/mean_terminated_length`: The average length of generated completions that terminate with EOS.
146146
- `completions/min_terminated_length`: The minimum length of generated completions that terminate with EOS.
147147
- `completions/max_terminated_length`: The maximum length of generated completions that terminate with EOS.
148-
- `completions/clipped_ratio` : The ratio of truncated (clipped) completions.
148+
- `completions/clipped_ratio`: The ratio of truncated (clipped) completions.
149149
- `reward/{reward_func_name}/mean`: The average reward from a specific reward function.
150150
- `reward/{reward_func_name}/std`: The standard deviation of the reward from a specific reward function.
151151
- `reward`: The overall average reward after applying reward weights.

0 commit comments

Comments
 (0)