huggingface · qgallouedec · Sep 29, 2025 · Sep 28, 2025 · Sep 28, 2025 · Sep 29, 2025
diff --git a/docs/source/community_tutorials.md b/docs/source/community_tutorials.md
@@ -29,13 +29,10 @@ Community tutorials are made by active members of the Hugging Face community who
 <details>
 <summary>⚠️ Deprecated features notice for "How to fine-tune a smol-LM with Hugging Face, TRL, and the smoltalk Dataset" (click to expand)</summary>
 
-<Tip warning={true}>
-
-The tutorial uses two deprecated features:
-- `SFTTrainer(..., tokenizer=tokenizer)`: Use `SFTTrainer(..., processing_class=tokenizer)` instead, or simply omit it (it will be inferred from the model).
-- `setup_chat_format(model, tokenizer)`: Use `SFTConfig(..., chat_template_path="Qwen/Qwen3-0.6B")`, where `chat_template_path` specifies the model whose chat template you want to copy.
-
-</Tip>
+> [!WARNING]
+> The tutorial uses two deprecated features:
+> - `SFTTrainer(..., tokenizer=tokenizer)`: Use `SFTTrainer(..., processing_class=tokenizer)` instead, or simply omit it (it will be inferred from the model).
+> - `setup_chat_format(model, tokenizer)`: Use `SFTConfig(..., chat_template_path="Qwen/Qwen3-0.6B")`, where `chat_template_path` specifies the model whose chat template you want to copy.
 
 </details>
 

diff --git a/docs/source/dataset_formats.md b/docs/source/dataset_formats.md
@@ -289,31 +289,28 @@ prompt_only_example = {"prompt": [{"role": "user", "content": "What color is the
 
 For examples of prompt-only datasets, refer to the [Prompt-only datasets collection](https://huggingface.co/collections/trl-lib/prompt-only-datasets-677ea25245d20252cea00368).
 
-<Tip>
-
-While both the prompt-only and language modeling types are similar, they differ in how the input is handled. In the prompt-only type, the prompt represents a partial input that expects the model to complete or continue, while in the language modeling type, the input is treated as a complete sentence or sequence. These two types are processed differently by TRL. Below is an example showing the difference in the output of the `apply_chat_template` function for each type:
-
-```python
-from transformers import AutoTokenizer
-from trl import apply_chat_template
-
-tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
-
-# Example for prompt-only type
-prompt_only_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}
-apply_chat_template(prompt_only_example, tokenizer)
-# Output: {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n'}
-
-# Example for language modeling type
-lm_example = {"messages": [{"role": "user", "content": "What color is the sky?"}]}
-apply_chat_template(lm_example, tokenizer)
-# Output: {'text': '<|user|>\nWhat color is the sky?<|end|>\n<|endoftext|>'}
-```
-
-- The prompt-only output includes a `'<|assistant|>\n'`, indicating the beginning of the assistant’s turn and expecting the model to generate a completion.
-- In contrast, the language modeling output treats the input as a complete sequence and terminates it with `'<|endoftext|>'`, signaling the end of the text and not expecting any additional content.
-
-</Tip>
+> [!TIP]
+> While both the prompt-only and language modeling types are similar, they differ in how the input is handled. In the prompt-only type, the prompt represents a partial input that expects the model to complete or continue, while in the language modeling type, the input is treated as a complete sentence or sequence. These two types are processed differently by TRL. Below is an example showing the difference in the output of the `apply_chat_template` function for each type:
+>
+> ```python
+> from transformers import AutoTokenizer
+> from trl import apply_chat_template
+>
+> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
+>
+> # Example for prompt-only type
+> prompt_only_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}
+> apply_chat_template(prompt_only_example, tokenizer)
+> # Output: {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n'}
+>
+> # Example for language modeling type
+> lm_example = {"messages": [{"role": "user", "content": "What color is the sky?"}]}
+> apply_chat_template(lm_example, tokenizer)
+> # Output: {'text': '<|user|>\nWhat color is the sky?<|end|>\n<|endoftext|>'}
+> ```
+>
+> - The prompt-only output includes a `'<|assistant|>\n'`, indicating the beginning of the assistant’s turn and expecting the model to generate a completion.
+> - In contrast, the language modeling output treats the input as a complete sequence and terminates it with `'<|endoftext|>'`, signaling the end of the text and not expecting any additional content.
 
 #### Prompt-completion
 
@@ -408,12 +405,9 @@ Choosing the right dataset type depends on the task you are working on and the s
 | [`SFTTrainer`]          | [Language modeling](#language-modeling) or [Prompt-completion](#prompt-completion)                     |
 | [`XPOTrainer`]          | [Prompt-only](#prompt-only)                                                                            |
 
-<Tip>
-
-TRL trainers only support standard dataset formats, [for now](https://github.com/huggingface/trl/issues/2071). If you have a conversational dataset, you must first convert it into a standard format.
-For more information on how to work with conversational datasets, refer to the [Working with conversational datasets in TRL](#working-with-conversational-datasets-in-trl) section.
-
-</Tip>
+> [!TIP]
+> TRL trainers only support standard dataset formats, [for now](https://github.com/huggingface/trl/issues/2071). If you have a conversational dataset, you must first convert it into a standard format.
+> For more information on how to work with conversational datasets, refer to the [Working with conversational datasets in TRL](#working-with-conversational-datasets-in-trl) section.
 
 ## Working with conversational datasets in TRL
 
@@ -465,27 +459,21 @@ dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
 #  'completion': ['It is blue.<|end|>\n<|endoftext|>', 'In the sky.<|end|>\n<|endoftext|>']}
 ```
 
-<Tip warning={true}>
-
-We recommend using the [`apply_chat_template`] function instead of calling `tokenizer.apply_chat_template` directly. Handling chat templates for non-language modeling datasets can be tricky and may result in errors, such as mistakenly placing a system prompt in the middle of a conversation.
-For additional examples, see [#1930 (comment)](https://github.com/huggingface/trl/pull/1930#issuecomment-2292908614). The [`apply_chat_template`] is designed to handle these intricacies and ensure the correct application of chat templates for various tasks.
-
-</Tip>
-
-<Tip warning={true}>
-
-It's important to note that chat templates are model-specific. For example, if you use the chat template from [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) with the above example, you get a different output:
-
-```python
-apply_chat_template(example, AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct"))
-# Output:
-# {'prompt': '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat color is the sky?<|im_end|>\n<|im_start|>assistant\n',
-#  'completion': 'It is blue.<|im_end|>\n'}
-```
-
-Always use the chat template associated with the model you're working with. Using the wrong template can lead to inaccurate or unexpected results.
-
-</Tip>
+> [!WARNING]
+> We recommend using the [`apply_chat_template`] function instead of calling `tokenizer.apply_chat_template` directly. Handling chat templates for non-language modeling datasets can be tricky and may result in errors, such as mistakenly placing a system prompt in the middle of a conversation.
+> For additional examples, see [#1930 (comment)](https://github.com/huggingface/trl/pull/1930#issuecomment-2292908614). The [`apply_chat_template`] is designed to handle these intricacies and ensure the correct application of chat templates for various tasks.
+
+> [!WARNING]
+> It's important to note that chat templates are model-specific. For example, if you use the chat template from [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) with the above example, you get a different output:
+>
+> ```python
+> apply_chat_template(example, AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct"))
+> # Output:
+> # {'prompt': '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat color is the sky?<|im_end|>\n<|im_start|>assistant\n',
+> #  'completion': 'It is blue.<|im_end|>\n'}
+> ```
+>
+> Always use the chat template associated with the model you're working with. Using the wrong template can lead to inaccurate or unexpected results.
 
 ## Using any dataset with TRL: preprocessing and conversion
 
@@ -715,13 +703,10 @@ dataset = unpair_preference_dataset(dataset)
  'label': True}
 ```
 
-<Tip warning={true}>
-
-Keep in mind that the `"chosen"` and `"rejected"` completions in a preference dataset can be both good or bad.
-Before applying [`unpair_preference_dataset`], please ensure that all `"chosen"` completions can be labeled as good and all `"rejected"` completions as bad.
-This can be ensured by checking absolute rating of each completion, e.g. from a reward model.
-
-</Tip>
+> [!WARNING]
+> Keep in mind that the `"chosen"` and `"rejected"` completions in a preference dataset can be both good or bad.
+> Before applying [`unpair_preference_dataset`], please ensure that all `"chosen"` completions can be labeled as good and all `"rejected"` completions as bad.
+> This can be ensured by checking absolute rating of each completion, e.g. from a reward model.
 
 ### From preference to language modeling dataset
 
@@ -856,13 +841,10 @@ dataset = unpair_preference_dataset(dataset)
  'label': True}
 ```
 
-<Tip warning={true}>
-
-Keep in mind that the `"chosen"` and `"rejected"` completions in a preference dataset can be both good or bad.
-Before applying [`unpair_preference_dataset`], please ensure that all `"chosen"` completions can be labeled as good and all `"rejected"` completions as bad.
-This can be ensured by checking absolute rating of each completion, e.g. from a reward model.
-
-</Tip>
+> [!WARNING]
+> Keep in mind that the `"chosen"` and `"rejected"` completions in a preference dataset can be both good or bad.
+> Before applying [`unpair_preference_dataset`], please ensure that all `"chosen"` completions can be labeled as good and all `"rejected"` completions as bad.
+> This can be ensured by checking absolute rating of each completion, e.g. from a reward model.
 
 ### From unpaired preference to language modeling dataset
 

diff --git a/docs/source/deepspeed_integration.md b/docs/source/deepspeed_integration.md
@@ -1,10 +1,7 @@
 # DeepSpeed Integration
 
-<Tip warning={true}>
-
-Section under construction. Feel free to contribute!
-
-</Tip>
+> [!WARNING]
+> Section under construction. Feel free to contribute!
 
 TRL supports training with DeepSpeed, a library that implements advanced training optimization techniques. These include optimizer state partitioning, offloading, gradient partitioning, and more.
 

diff --git a/docs/source/distributing_training.md b/docs/source/distributing_training.md
@@ -1,8 +1,7 @@
 # Distributing Training
 
-<Tip warning={true}>
-Section under construction. Feel free to contribute!
-</Tip>
+> [!WARNING]
+> Section under construction. Feel free to contribute!
 
 ## Multi-GPU Training with TRL
 
@@ -49,11 +48,8 @@ Example, these configurations are equivalent, and should yield the same results:
 | 1 | 4 | 8 | Lower memory usage, slower training |
 | 8 | 4 | 1 | Multi-GPU to get the best of both worlds |
 
-<Tip> 
-
-Having one model per GPU can lead to high memory usage, which may not be feasible for large models or low-memory GPUs. In such cases, you can leverage [DeepSpeed](https://github.com/deepspeedai/DeepSpeed), which provides optimizations like model sharding, Zero Redundancy Optimizer, mixed precision training, and offloading to CPU or NVMe. Check out our [DeepSpeed Integration](deepspeed_integration) guide for more details.
-
-</Tip>
+> [!TIP]
+> Having one model per GPU can lead to high memory usage, which may not be feasible for large models or low-memory GPUs. In such cases, you can leverage [DeepSpeed](https://github.com/deepspeedai/DeepSpeed), which provides optimizations like model sharding, Zero Redundancy Optimizer, mixed precision training, and offloading to CPU or NVMe. Check out our [DeepSpeed Integration](deepspeed_integration) guide for more details.
 
 ## Context Parallelism
 
@@ -176,13 +172,10 @@ These results show that **Context Parallelism (CP) scales effectively with more
   <img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/context_parallelism_s_it_plot.png" alt="CP seconds/iteration" width="45%"/>
 </div>
 
-<Tip>
-
-Accelerate also supports **N-Dimensional Parallelism (ND-parallelism)**, which enables you to combine different parallelization strategies to efficiently distribute model training across multiple GPUs.  
-
-You can learn more and explore configuration examples in the [Accelerate ND-parallelism guide](https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/README.md#nd-parallelism).
-
-</Tip>
+> [!TIP]
+> Accelerate also supports **N-Dimensional Parallelism (ND-parallelism)**, which enables you to combine different parallelization strategies to efficiently distribute model training across multiple GPUs.  
+>
+> You can learn more and explore configuration examples in the [Accelerate ND-parallelism guide](https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/README.md#nd-parallelism).
 
 
 **Further Reading on Context Parallelism**  

diff --git a/docs/source/experimental.md b/docs/source/experimental.md
@@ -2,11 +2,8 @@
 
 The `trl.experimental` namespace provides a minimal, clearly separated space for fast iteration on new ideas.
 
-<Tip warning={true}>
-
-**Stability contract:** Anything under `trl.experimental` may change or be removed in *any* release (including patch versions) without prior deprecation. Do not rely on these APIs for production workloads.
-
-</Tip>
+> [!WARNING]
+> **Stability contract:** Anything under `trl.experimental` may change or be removed in *any* release (including patch versions) without prior deprecation. Do not rely on these APIs for production workloads.
 
 ## Current Experimental Features
 
@@ -95,11 +92,8 @@ training_args = GRPOConfig(
 )
 ```
 
-<Tip warning={true}>
-
-To leverage GSPO-token, the user will need to provide the per-token advantage  \\( \hat{A_{i,t}} \\) for each token  \\( t \\) in the sequence  \\( i \\) (i.e., make  \\( \hat{A_{i,t}} \\) varies with  \\( t \\)—which isn't the case here,  \\( \hat{A_{i,t}}=\hat{A_{i}} \\)). Otherwise, GSPO-Token gradient is just equivalent to the original GSPO implementation.
-
-</Tip>
+> [!WARNING]
+> To leverage GSPO-token, the user will need to provide the per-token advantage  \\( \hat{A_{i,t}} \\) for each token  \\( t \\) in the sequence  \\( i \\) (i.e., make  \\( \hat{A_{i,t}} \\) varies with  \\( t \\)—which isn't the case here,  \\( \hat{A_{i,t}}=\hat{A_{i}} \\)). Otherwise, GSPO-Token gradient is just equivalent to the original GSPO implementation.
 
 ### GRPO With Replay Buffer