diff --git a/docs/source/community_tutorials.md b/docs/source/community_tutorials.md
index df86bc85209..7cb1dda8a0d 100644
--- a/docs/source/community_tutorials.md
+++ b/docs/source/community_tutorials.md
@@ -29,13 +29,10 @@ Community tutorials are made by active members of the Hugging Face community who
 <details>
 <summary>⚠️ Deprecated features notice for "How to fine-tune a smol-LM with Hugging Face, TRL, and the smoltalk Dataset" (click to expand)</summary>
 
-<Tip warning={true}>
-
-The tutorial uses two deprecated features:
-- `SFTTrainer(..., tokenizer=tokenizer)`: Use `SFTTrainer(..., processing_class=tokenizer)` instead, or simply omit it (it will be inferred from the model).
-- `setup_chat_format(model, tokenizer)`: Use `SFTConfig(..., chat_template_path="Qwen/Qwen3-0.6B")`, where `chat_template_path` specifies the model whose chat template you want to copy.
-
-</Tip>
+> [!WARNING]
+> The tutorial uses two deprecated features:
+> - `SFTTrainer(..., tokenizer=tokenizer)`: Use `SFTTrainer(..., processing_class=tokenizer)` instead, or simply omit it (it will be inferred from the model).
+> - `setup_chat_format(model, tokenizer)`: Use `SFTConfig(..., chat_template_path="Qwen/Qwen3-0.6B")`, where `chat_template_path` specifies the model whose chat template you want to copy.
 
 </details>
 
diff --git a/docs/source/dataset_formats.md b/docs/source/dataset_formats.md
index 8a105ff5e34..ac1ea6a45a2 100644
--- a/docs/source/dataset_formats.md
+++ b/docs/source/dataset_formats.md
@@ -289,31 +289,28 @@ prompt_only_example = {"prompt": [{"role": "user", "content": "What color is the
 
 For examples of prompt-only datasets, refer to the [Prompt-only datasets collection](https://huggingface.co/collections/trl-lib/prompt-only-datasets-677ea25245d20252cea00368).
 
-<Tip>
-
-While both the prompt-only and language modeling types are similar, they differ in how the input is handled. In the prompt-only type, the prompt represents a partial input that expects the model to complete or continue, while in the language modeling type, the input is treated as a complete sentence or sequence. These two types are processed differently by TRL. Below is an example showing the difference in the output of the `apply_chat_template` function for each type:
-
-```python
-from transformers import AutoTokenizer
-from trl import apply_chat_template
-
-tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
-
-# Example for prompt-only type
-prompt_only_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}
-apply_chat_template(prompt_only_example, tokenizer)
-# Output: {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n'}
-
-# Example for language modeling type
-lm_example = {"messages": [{"role": "user", "content": "What color is the sky?"}]}
-apply_chat_template(lm_example, tokenizer)
-# Output: {'text': '<|user|>\nWhat color is the sky?<|end|>\n<|endoftext|>'}
-```
-
-- The prompt-only output includes a `'<|assistant|>\n'`, indicating the beginning of the assistant’s turn and expecting the model to generate a completion.
-- In contrast, the language modeling output treats the input as a complete sequence and terminates it with `'<|endoftext|>'`, signaling the end of the text and not expecting any additional content.
-
-</Tip>
+> [!TIP]
+> While both the prompt-only and language modeling types are similar, they differ in how the input is handled. In the prompt-only type, the prompt represents a partial input that expects the model to complete or continue, while in the language modeling type, the input is treated as a complete sentence or sequence. These two types are processed differently by TRL. Below is an example showing the difference in the output of the `apply_chat_template` function for each type:
+>
+> ```python
+> from transformers import AutoTokenizer
+> from trl import apply_chat_template
+>
+> tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
+>
+> # Example for prompt-only type
+> prompt_only_example = {"prompt": [{"role": "user", "content": "What color is the sky?"}]}
+> apply_chat_template(prompt_only_example, tokenizer)
+> # Output: {'prompt': '<|user|>\nWhat color is the sky?<|end|>\n<|assistant|>\n'}
+>
+> # Example for language modeling type
+> lm_example = {"messages": [{"role": "user", "content": "What color is the sky?"}]}
+> apply_chat_template(lm_example, tokenizer)
+> # Output: {'text': '<|user|>\nWhat color is the sky?<|end|>\n<|endoftext|>'}
+> ```
+>
+> - The prompt-only output includes a `'<|assistant|>\n'`, indicating the beginning of the assistant’s turn and expecting the model to generate a completion.
+> - In contrast, the language modeling output treats the input as a complete sequence and terminates it with `'<|endoftext|>'`, signaling the end of the text and not expecting any additional content.
 
 #### Prompt-completion
 
@@ -408,12 +405,9 @@ Choosing the right dataset type depends on the task you are working on and the s
 | [`SFTTrainer`]          | [Language modeling](#language-modeling) or [Prompt-completion](#prompt-completion)                     |
 | [`XPOTrainer`]          | [Prompt-only](#prompt-only)                                                                            |
 
-<Tip>
-
-TRL trainers only support standard dataset formats, [for now](https://github.com/huggingface/trl/issues/2071). If you have a conversational dataset, you must first convert it into a standard format.
-For more information on how to work with conversational datasets, refer to the [Working with conversational datasets in TRL](#working-with-conversational-datasets-in-trl) section.
-
-</Tip>
+> [!TIP]
+> TRL trainers only support standard dataset formats, [for now](https://github.com/huggingface/trl/issues/2071). If you have a conversational dataset, you must first convert it into a standard format.
+> For more information on how to work with conversational datasets, refer to the [Working with conversational datasets in TRL](#working-with-conversational-datasets-in-trl) section.
 
 ## Working with conversational datasets in TRL
 
@@ -465,27 +459,21 @@ dataset = dataset.map(apply_chat_template, fn_kwargs={"tokenizer": tokenizer})
 #  'completion': ['It is blue.<|end|>\n<|endoftext|>', 'In the sky.<|end|>\n<|endoftext|>']}
 ```
 
-<Tip warning={true}>
-
-We recommend using the [`apply_chat_template`] function instead of calling `tokenizer.apply_chat_template` directly. Handling chat templates for non-language modeling datasets can be tricky and may result in errors, such as mistakenly placing a system prompt in the middle of a conversation.
-For additional examples, see [#1930 (comment)](https://github.com/huggingface/trl/pull/1930#issuecomment-2292908614). The [`apply_chat_template`] is designed to handle these intricacies and ensure the correct application of chat templates for various tasks.
-
-</Tip>
-
-<Tip warning={true}>
-
-It's important to note that chat templates are model-specific. For example, if you use the chat template from [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) with the above example, you get a different output:
-
-```python
-apply_chat_template(example, AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct"))
-# Output:
-# {'prompt': '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat color is the sky?<|im_end|>\n<|im_start|>assistant\n',
-#  'completion': 'It is blue.<|im_end|>\n'}
-```
-
-Always use the chat template associated with the model you're working with. Using the wrong template can lead to inaccurate or unexpected results.
-
-</Tip>
+> [!WARNING]
+> We recommend using the [`apply_chat_template`] function instead of calling `tokenizer.apply_chat_template` directly. Handling chat templates for non-language modeling datasets can be tricky and may result in errors, such as mistakenly placing a system prompt in the middle of a conversation.
+> For additional examples, see [#1930 (comment)](https://github.com/huggingface/trl/pull/1930#issuecomment-2292908614). The [`apply_chat_template`] is designed to handle these intricacies and ensure the correct application of chat templates for various tasks.
+
+> [!WARNING]
+> It's important to note that chat templates are model-specific. For example, if you use the chat template from [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) with the above example, you get a different output:
+>
+> ```python
+> apply_chat_template(example, AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct"))
+> # Output:
+> # {'prompt': '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWhat color is the sky?<|im_end|>\n<|im_start|>assistant\n',
+> #  'completion': 'It is blue.<|im_end|>\n'}
+> ```
+>
+> Always use the chat template associated with the model you're working with. Using the wrong template can lead to inaccurate or unexpected results.
 
 ## Using any dataset with TRL: preprocessing and conversion
 
@@ -715,13 +703,10 @@ dataset = unpair_preference_dataset(dataset)
  'label': True}
 ```
 
-<Tip warning={true}>
-
-Keep in mind that the `"chosen"` and `"rejected"` completions in a preference dataset can be both good or bad.
-Before applying [`unpair_preference_dataset`], please ensure that all `"chosen"` completions can be labeled as good and all `"rejected"` completions as bad.
-This can be ensured by checking absolute rating of each completion, e.g. from a reward model.
-
-</Tip>
+> [!WARNING]
+> Keep in mind that the `"chosen"` and `"rejected"` completions in a preference dataset can be both good or bad.
+> Before applying [`unpair_preference_dataset`], please ensure that all `"chosen"` completions can be labeled as good and all `"rejected"` completions as bad.
+> This can be ensured by checking absolute rating of each completion, e.g. from a reward model.
 
 ### From preference to language modeling dataset
 
@@ -856,13 +841,10 @@ dataset = unpair_preference_dataset(dataset)
  'label': True}
 ```
 
-<Tip warning={true}>
-
-Keep in mind that the `"chosen"` and `"rejected"` completions in a preference dataset can be both good or bad.
-Before applying [`unpair_preference_dataset`], please ensure that all `"chosen"` completions can be labeled as good and all `"rejected"` completions as bad.
-This can be ensured by checking absolute rating of each completion, e.g. from a reward model.
-
-</Tip>
+> [!WARNING]
+> Keep in mind that the `"chosen"` and `"rejected"` completions in a preference dataset can be both good or bad.
+> Before applying [`unpair_preference_dataset`], please ensure that all `"chosen"` completions can be labeled as good and all `"rejected"` completions as bad.
+> This can be ensured by checking absolute rating of each completion, e.g. from a reward model.
 
 ### From unpaired preference to language modeling dataset
 
diff --git a/docs/source/deepspeed_integration.md b/docs/source/deepspeed_integration.md
index 0f6980656a3..a605787972e 100644
--- a/docs/source/deepspeed_integration.md
+++ b/docs/source/deepspeed_integration.md
@@ -1,10 +1,7 @@
 # DeepSpeed Integration
 
-<Tip warning={true}>
-
-Section under construction. Feel free to contribute!
-
-</Tip>
+> [!WARNING]
+> Section under construction. Feel free to contribute!
 
 TRL supports training with DeepSpeed, a library that implements advanced training optimization techniques. These include optimizer state partitioning, offloading, gradient partitioning, and more.
 
diff --git a/docs/source/distributing_training.md b/docs/source/distributing_training.md
index 88d06f58fad..fc187eb40b3 100644
--- a/docs/source/distributing_training.md
+++ b/docs/source/distributing_training.md
@@ -1,8 +1,7 @@
 # Distributing Training
 
-<Tip warning={true}>
-Section under construction. Feel free to contribute!
-</Tip>
+> [!WARNING]
+> Section under construction. Feel free to contribute!
 
 ## Multi-GPU Training with TRL
 
@@ -49,11 +48,8 @@ Example, these configurations are equivalent, and should yield the same results:
 | 1 | 4 | 8 | Lower memory usage, slower training |
 | 8 | 4 | 1 | Multi-GPU to get the best of both worlds |
 
-<Tip> 
-
-Having one model per GPU can lead to high memory usage, which may not be feasible for large models or low-memory GPUs. In such cases, you can leverage [DeepSpeed](https://github.com/deepspeedai/DeepSpeed), which provides optimizations like model sharding, Zero Redundancy Optimizer, mixed precision training, and offloading to CPU or NVMe. Check out our [DeepSpeed Integration](deepspeed_integration) guide for more details.
-
-</Tip>
+> [!TIP]
+> Having one model per GPU can lead to high memory usage, which may not be feasible for large models or low-memory GPUs. In such cases, you can leverage [DeepSpeed](https://github.com/deepspeedai/DeepSpeed), which provides optimizations like model sharding, Zero Redundancy Optimizer, mixed precision training, and offloading to CPU or NVMe. Check out our [DeepSpeed Integration](deepspeed_integration) guide for more details.
 
 ## Context Parallelism
 
@@ -176,13 +172,10 @@ These results show that **Context Parallelism (CP) scales effectively with more
   <img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/context_parallelism_s_it_plot.png" alt="CP seconds/iteration" width="45%"/>
 </div>
 
-<Tip>
-
-Accelerate also supports **N-Dimensional Parallelism (ND-parallelism)**, which enables you to combine different parallelization strategies to efficiently distribute model training across multiple GPUs.  
-
-You can learn more and explore configuration examples in the [Accelerate ND-parallelism guide](https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/README.md#nd-parallelism).
-
-</Tip>
+> [!TIP]
+> Accelerate also supports **N-Dimensional Parallelism (ND-parallelism)**, which enables you to combine different parallelization strategies to efficiently distribute model training across multiple GPUs.  
+>
+> You can learn more and explore configuration examples in the [Accelerate ND-parallelism guide](https://github.com/huggingface/accelerate/blob/main/examples/torch_native_parallelism/README.md#nd-parallelism).
 
 
 **Further Reading on Context Parallelism**  
diff --git a/docs/source/experimental.md b/docs/source/experimental.md
index b00b89efc82..595084c1aca 100644
--- a/docs/source/experimental.md
+++ b/docs/source/experimental.md
@@ -2,11 +2,8 @@
 
 The `trl.experimental` namespace provides a minimal, clearly separated space for fast iteration on new ideas.
 
-<Tip warning={true}>
-
-**Stability contract:** Anything under `trl.experimental` may change or be removed in *any* release (including patch versions) without prior deprecation. Do not rely on these APIs for production workloads.
-
-</Tip>
+> [!WARNING]
+> **Stability contract:** Anything under `trl.experimental` may change or be removed in *any* release (including patch versions) without prior deprecation. Do not rely on these APIs for production workloads.
 
 ## Current Experimental Features
 
@@ -95,11 +92,8 @@ training_args = GRPOConfig(
 )
 ```
 
-<Tip warning={true}>
-
-To leverage GSPO-token, the user will need to provide the per-token advantage  \\( \hat{A_{i,t}} \\) for each token  \\( t \\) in the sequence  \\( i \\) (i.e., make  \\( \hat{A_{i,t}} \\) varies with  \\( t \\)—which isn't the case here,  \\( \hat{A_{i,t}}=\hat{A_{i}} \\)). Otherwise, GSPO-Token gradient is just equivalent to the original GSPO implementation.
-
-</Tip>
+> [!WARNING]
+> To leverage GSPO-token, the user will need to provide the per-token advantage  \\( \hat{A_{i,t}} \\) for each token  \\( t \\) in the sequence  \\( i \\) (i.e., make  \\( \hat{A_{i,t}} \\) varies with  \\( t \\)—which isn't the case here,  \\( \hat{A_{i,t}}=\hat{A_{i}} \\)). Otherwise, GSPO-Token gradient is just equivalent to the original GSPO implementation.
 
 ### GRPO With Replay Buffer
 
diff --git a/docs/source/grpo_trainer.md b/docs/source/grpo_trainer.md
index f53e4e2ced8..a8d058d4194 100644
--- a/docs/source/grpo_trainer.md
+++ b/docs/source/grpo_trainer.md
@@ -76,17 +76,12 @@ $$\hat{A}_{i,t} = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}$$
 
 This approach gives the method its name: **Group Relative Policy Optimization (GRPO)**.
 
-<Tip>
+> [!TIP]
+> It was shown in the paper [Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783) that scaling by  \\( \text{std}(\mathbf{r}) \\) may cause a question-level difficulty bias. You can disable this scaling by setting `scale_rewards=False` in [`GRPOConfig`].
 
-It was shown in the paper [Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783) that scaling by  \\( \text{std}(\mathbf{r}) \\) may cause a question-level difficulty bias. You can disable this scaling by setting `scale_rewards=False` in [`GRPOConfig`].
-
-</Tip>
-
-<Tip>
-
-[Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO)](https://huggingface.co/papers/2508.08221) showed that calculating the mean at the local (group) level and the standard deviation at the global (batch) level enables more robust reward shaping. You can use this scaling strategy by setting `scale_rewards="batch"` in [`GRPOConfig`].
-
-</Tip>
+> [!TIP]
+> 
+> [Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO)](https://huggingface.co/papers/2508.08221) showed that calculating the mean at the local (group) level and the standard deviation at the global (batch) level enables more robust reward shaping. You can use this scaling strategy by setting `scale_rewards="batch"` in [`GRPOConfig`].
 
 ### Estimating the KL divergence
 
@@ -105,17 +100,11 @@ $$
 
 where the first term represents the scaled advantage and the second term penalizes deviations from the reference policy through KL divergence.
 
-<Tip>
-
-Note that compared to the original formulation in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300), we don't scale by  \\( \frac{1}{|o_i|} \\) because it was shown in the paper [Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783) that this introduces a response-level length bias. More details in [loss types](#loss-types).
-
-</Tip>
-
-<Tip>
+> [!TIP]
+> Note that compared to the original formulation in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300), we don't scale by  \\( \frac{1}{|o_i|} \\) because it was shown in the paper [Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783) that this introduces a response-level length bias. More details in [loss types](#loss-types).
 
-Note that compared to the original formulation in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300), we use  \\( \beta = 0.0 \\) by default, meaning that the KL divergence term is not used. This choice is motivated by several recent studies (e.g., [Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model](https://huggingface.co/papers/2503.24290)) which have shown that the KL divergence term is not essential for training with GRPO. As a result, it has become common practice to exclude it (e.g. [Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783), [DAPO: An Open-Source LLM Reinforcement Learning System at Scale](https://huggingface.co/papers/2503.14476)). If you wish to include the KL divergence term, you can set `beta` in [`GRPOConfig`] to a non-zero value.
-
-</Tip>
+> [!TIP]
+> Note that compared to the original formulation in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300), we use  \\( \beta = 0.0 \\) by default, meaning that the KL divergence term is not used. This choice is motivated by several recent studies (e.g., [Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model](https://huggingface.co/papers/2503.24290)) which have shown that the KL divergence term is not essential for training with GRPO. As a result, it has become common practice to exclude it (e.g. [Understanding R1-Zero-Like Training: A Critical Perspective](https://huggingface.co/papers/2503.20783), [DAPO: An Open-Source LLM Reinforcement Learning System at Scale](https://huggingface.co/papers/2503.14476)). If you wish to include the KL divergence term, you can set `beta` in [`GRPOConfig`] to a non-zero value.
 
 In the original paper, this formulation is generalized to account for multiple updates after each generation (denoted  \\( \mu \\), can be set with `num_iterations` in [`GRPOConfig`]) by leveraging the **clipped surrogate objective**:
 
@@ -198,11 +187,8 @@ pip install trl[vllm]
 
 We support two ways of using vLLM during training: **server mode** and **colocate mode**.
 
-<Tip>
-
-By default, Truncated Importance Sampling is activated for vLLM generation to address the generation-training mismatch that occurs when using different frameworks. This can be turned off by setting `vllm_importance_sampling_correction=False`. For more information, see [Truncated Importance Sampling](paper_index#truncated-importance-sampling)
-
-</Tip>
+> [!TIP]
+> By default, Truncated Importance Sampling is activated for vLLM generation to address the generation-training mismatch that occurs when using different frameworks. This can be turned off by setting `vllm_importance_sampling_correction=False`. For more information, see [Truncated Importance Sampling](paper_index#truncated-importance-sampling)
 
 #### 🔌 Option 1: Server mode
 
@@ -224,11 +210,8 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm
    )
    ```
 
-<Tip warning={true}>
-
-Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.
-
-</Tip>
+> [!WARNING]
+> Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.
 
 #### 🧩 Option 2: Colocate mode
 
@@ -244,30 +227,24 @@ training_args = GRPOConfig(
 )
 ```
 
-<Tip>
-
-Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`GRPOConfig`] to avoid underutilization or out-of-memory errors.
-
-We provide a [HF Space](https://huggingface.co/spaces/trl-lib/recommend-vllm-memory) to help estimate the recommended GPU memory utilization based on your model configuration and experiment settings. Simply use it as follows to get `vllm_gpu_memory_utilization` recommendation:
-
-<iframe
-	src="https://trl-lib-recommend-vllm-memory.hf.space"
-	frameborder="0"
-	width="850"
-	height="450"
-></iframe>
-
-If the recommended value does not work in your environment, we suggest adding a small buffer (e.g., +0.05 or +0.1) to the recommended value to ensure stability.
-
-If you still find you are getting out-of-memory errors set `vllm_enable_sleep_mode` to True and the vllm parameters and cache will be offloaded during the optimization step. For more information, see [Reducing Memory Usage with vLLM Sleep Mode](reducing_memory_usage#vllm-sleep-mode).
-
-</Tip>
-
-<Tip>
-
-By default, GRPO uses `MASTER_ADDR=localhost` and `MASTER_PORT=12345` for vLLM, but you can override these values by setting the environment variables accordingly.
-
-</Tip>
+> [!TIP]
+> Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`GRPOConfig`] to avoid underutilization or out-of-memory errors.
+>
+> We provide a [HF Space](https://huggingface.co/spaces/trl-lib/recommend-vllm-memory) to help estimate the recommended GPU memory utilization based on your model configuration and experiment settings. Simply use it as follows to get `vllm_gpu_memory_utilization` recommendation:
+>
+> <iframe
+> 	src="https://trl-lib-recommend-vllm-memory.hf.space"
+> 	frameborder="0"
+> 	width="850"
+> 	height="450"
+> ></iframe>
+>
+> If the recommended value does not work in your environment, we suggest adding a small buffer (e.g., +0.05 or +0.1) to the recommended value to ensure stability.
+>
+> If you still find you are getting out-of-memory errors set `vllm_enable_sleep_mode` to True and the vllm parameters and cache will be offloaded during the optimization step. For more information, see [Reducing Memory Usage with vLLM Sleep Mode](reducing_memory_usage#vllm-sleep-mode).
+
+> [!TIP]
+> By default, GRPO uses `MASTER_ADDR=localhost` and `MASTER_PORT=12345` for vLLM, but you can override these values by setting the environment variables accordingly.
 
 For more information, see [Speeding up training with vLLM](speeding_up_training#vllm-for-fast-generation-in-online-methods).
 
@@ -563,11 +540,8 @@ Tested with:
 - **Qwen2.5-VL** — e.g., `Qwen/Qwen2.5-VL-3B-Instruct`
 - **SmolVLM2** — e.g., `HuggingFaceTB/SmolVLM2-2.2B-Instruct`
   
-<Tip>
-
-Compatibility with all VLMs is not guaranteed. If you believe a model should be supported, feel free to open an issue on GitHub — or better yet, submit a pull request with the required changes.
-
-</Tip>
+> [!TIP]
+> Compatibility with all VLMs is not guaranteed. If you believe a model should be supported, feel free to open an issue on GitHub — or better yet, submit a pull request with the required changes.
 
 ### Quick Start
 
@@ -593,11 +567,8 @@ accelerate launch \
 
 ### Configuration Tips
 
-<Tip warning={true}>
-
-VLM training may fail if image tokens are truncated. We highly recommend disabling truncation by setting `max_prompt_length` to `None`.
-
-</Tip>
+> [!WARNING]
+> VLM training may fail if image tokens are truncated. We highly recommend disabling truncation by setting `max_prompt_length` to `None`.
 
 - Use LoRA on vision-language projection layers
 - Enable 4-bit quantization to reduce memory usage
diff --git a/docs/source/jobs_training.md b/docs/source/jobs_training.md
index b5d871158e1..0cf49ce0e4d 100644
--- a/docs/source/jobs_training.md
+++ b/docs/source/jobs_training.md
@@ -83,14 +83,11 @@ To run successfully, the script needs:
 * **TRL installed**: Use the `--with trl` flag or the `dependencies` argument. uv installs these dependencies automatically before running the script.
 * **An authentication token**: Required to push the trained model (or perform other authenticated operations). Provide it with the `--secrets HF_TOKEN` flag or the `secrets` argument.
 
-<Tip warning={true}>
-
-When training with Jobs, be sure to:
-
-* **Set a sufficient timeout**. Jobs time out after 30 minutes by default. If your job exceeds the timeout, it will fail and all progress will be lost. See [Setting a custom timeout](https://huggingface.co/docs/huggingface_hub/guides/jobs#setting-a-custom-timeout).
-* **Push the model to the Hub**. The Jobs environment is ephemeral—files are deleted when the job ends. If you don’t push the model, it will be lost.
-
-</Tip>
+> [!WARNING]
+> When training with Jobs, be sure to:
+>
+> * **Set a sufficient timeout**. Jobs time out after 30 minutes by default. If your job exceeds the timeout, it will fail and all progress will be lost. See [Setting a custom timeout](https://huggingface.co/docs/huggingface_hub/guides/jobs#setting-a-custom-timeout).
+> * **Push the model to the Hub**. The Jobs environment is ephemeral—files are deleted when the job ends. If you don’t push the model, it will be lost.
 
 You can also run a script directly from a URL:
 
@@ -175,8 +172,6 @@ run_uv_job(
 </hfoption>
 </hfoptions>
 
-<Tip>
-
 TRL example scripts are fully uv-compatible, so you can run a complete training workflow directly on Jobs. You can customize training with standard script arguments plus hardware and secrets:
 
 <hfoptions id="script_type">
@@ -198,7 +193,6 @@ hf jobs uv run \
 
 ```python
 from huggingface_hub import run_uv_job
-
 run_uv_job(
     "https://raw.githubusercontent.com/huggingface/trl/refs/heads/main/examples/scripts/prm.py",
     flavor="a100-large",
@@ -214,11 +208,8 @@ run_uv_job(
 
 </hfoption>
 </hfoptions>
-
 See the full list of examples in [Maintained examples](example_overview#maintained-examples).
 
-</Tip>
-
 ### Docker Images
 
 An up-to-date Docker image with all TRL dependencies is available at [huggingface/trl](https://hub.docker.com/r/huggingface/trl) and can be used directly with Hugging Face Jobs:
diff --git a/docs/source/judges.md b/docs/source/judges.md
index d3fd1634161..1f3d0a0ab28 100644
--- a/docs/source/judges.md
+++ b/docs/source/judges.md
@@ -1,10 +1,7 @@
 # Judges
 
-<Tip warning={true}>
-
-TRL Judges is an experimental API which is subject to change at any time.
-
-</Tip>
+> [!WARNING]
+> TRL Judges is an experimental API which is subject to change at any time.
 
 TRL provides judges to easily compare two completions.
 
diff --git a/docs/source/kernels_hub.md b/docs/source/kernels_hub.md
index 50a8195e0fa..3d4f79e21c2 100644
--- a/docs/source/kernels_hub.md
+++ b/docs/source/kernels_hub.md
@@ -43,11 +43,8 @@ Or using the TRL CLI:
 trl sft ... --attn_implementation kernels-community/flash-attn
 ```
 
-<Tip>
-
-Now you can leverage faster attention backends with a pre-optimized kernel for your hardware configuration from the Hub, speeding up both development and training.
-
-</Tip>
+> [!TIP]
+> Now you can leverage faster attention backends with a pre-optimized kernel for your hardware configuration from the Hub, speeding up both development and training.
 
 
 ## Comparing Attention Implementations
diff --git a/docs/source/liger_kernel_integration.md b/docs/source/liger_kernel_integration.md
index a5c71d85c20..db2129c516f 100644
--- a/docs/source/liger_kernel_integration.md
+++ b/docs/source/liger_kernel_integration.md
@@ -1,10 +1,7 @@
 # Liger Kernel Integration
 
-<Tip warning={true}>
-
-Section under construction. Feel free to contribute!
-
-</Tip>
+> [!WARNING]
+> Section under construction. Feel free to contribute!
 
 [Liger Kernel](https://github.com/linkedin/Liger-Kernel) is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU training throughput by 20% and reduce memory usage by 60%. That way, we can **4x** our context length, as described in the benchmark below. They have implemented Hugging Face compatible `RMSNorm`, `RoPE`, `SwiGLU`, `CrossEntropy`, `FusedLinearCrossEntropy`, with more to come. The kernel works out of the box with [FlashAttention](https://github.com/Dao-AILab/flash-attention), [PyTorch FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html), and [Microsoft DeepSpeed](https://github.com/microsoft/DeepSpeed).
 
diff --git a/docs/source/nash_md_trainer.md b/docs/source/nash_md_trainer.md
index ac097a3a28a..941e25cccbe 100644
--- a/docs/source/nash_md_trainer.md
+++ b/docs/source/nash_md_trainer.md
@@ -85,11 +85,8 @@ Instead of a judge, you can chose to use a reward model -- see [Reward Bench](ht
   )
 ```
 
-<Tip warning={true}>
-
-Make sure that the SFT model and reward model use the _same_ chat template and the same tokenizer. Otherwise, you may find the model completions are scored incorrectly during training.
-
-</Tip>
+> [!WARNING]
+> Make sure that the SFT model and reward model use the _same_ chat template and the same tokenizer. Otherwise, you may find the model completions are scored incorrectly during training.
 
 ### Encourage EOS token generation
 
diff --git a/docs/source/paper_index.md b/docs/source/paper_index.md
index 142f9c09462..3ddfe06f72f 100644
--- a/docs/source/paper_index.md
+++ b/docs/source/paper_index.md
@@ -1,10 +1,7 @@
 # Paper Index
 
-<Tip warning={true}>
-
-Section under construction. Feel free to contribute!
-
-</Tip>
+> [!WARNING]
+> Section under construction. Feel free to contribute!
 
 ## Group Relative Policy Optimization
 
diff --git a/docs/source/prm_trainer.md b/docs/source/prm_trainer.md
index fe69f838b56..ad411488ea0 100644
--- a/docs/source/prm_trainer.md
+++ b/docs/source/prm_trainer.md
@@ -2,11 +2,8 @@
 
 [![](https://img.shields.io/badge/All_models-PRM-blue)](https://huggingface.co/models?other=prm,trl)
 
-<Tip warning={true}>
-
-PRM Trainer is an experimental API which is subject to change at any time.
-
-</Tip>
+> [!WARNING]
+> PRM Trainer is an experimental API which is subject to change at any time.
 
 ## Overview
 
diff --git a/docs/source/reducing_memory_usage.md b/docs/source/reducing_memory_usage.md
index 8df2d1dd6d5..b268ee4475c 100644
--- a/docs/source/reducing_memory_usage.md
+++ b/docs/source/reducing_memory_usage.md
@@ -1,10 +1,7 @@
 # Reducing Memory Usage
 
-<Tip warning={true}>
-
-Section under construction. Feel free to contribute!
-
-</Tip>
+> [!WARNING]
+> Section under construction. Feel free to contribute!
 
 ## Truncation
 
@@ -71,11 +68,8 @@ To help you choose an appropriate value, we provide a utility to visualize the s
 
 ## Packing
 
-<Tip>
-
-This technique applies only to SFT.
-
-</Tip>
+> [!TIP]
+> This technique applies only to SFT.
 
 
 [Truncation](#truncation) has several drawbacks:
@@ -90,11 +84,8 @@ Packing, introduced in [Raffel et al., 2020](https://huggingface.co/papers/1910.
 
 Packing reduces padding by merging several sequences in one row when possible. We use an advanced method to be near-optimal in the way we pack the dataset. To enable packing, use `packing=True` in the [`SFTConfig`].
 
-<Tip>
-
-In TRL 0.18 and earlier, packing used a more aggressive method that reduced padding to almost nothing, but had the downside of breaking sequence continuity for a large fraction of the dataset. To revert to this strategy, use `packing_strategy="wrapped"` in `SFTConfig`.
-
-</Tip>
+> [!TIP]
+> In TRL 0.18 and earlier, packing used a more aggressive method that reduced padding to almost nothing, but had the downside of breaking sequence continuity for a large fraction of the dataset. To revert to this strategy, use `packing_strategy="wrapped"` in `SFTConfig`.
 
 ```python
 from trl import SFTConfig
@@ -102,11 +93,8 @@ from trl import SFTConfig
 training_args = SFTConfig(..., packing=True, max_length=512)
 ```
 
-<Tip warning={true}>
-
-Packing may cause batch contamination, where adjacent sequences influence one another. This can be problematic for some applications. For more details, see [#1230](https://github.com/huggingface/trl/issues/1230).
-
-</Tip>
+> [!WARNING]
+> Packing may cause batch contamination, where adjacent sequences influence one another. This can be problematic for some applications. For more details, see [#1230](https://github.com/huggingface/trl/issues/1230).
 
 ## Liger for reducing peak memory usage
 
@@ -158,11 +146,8 @@ Padding-free batching is an alternative approach for reducing memory usage. In t
     <img src="https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/padding-free.png" alt="Padding-free batching" width="600"/>
 </div>
 
-<Tip warning={true}>
-
-It's highly recommended to use padding-free batching with **FlashAttention 2** or **FlashAttention 3**. Otherwise, you may encounter batch contamination issues.
-
-</Tip>
+> [!WARNING]
+> It's highly recommended to use padding-free batching with **FlashAttention 2** or **FlashAttention 3**. Otherwise, you may encounter batch contamination issues.
 
 <hfoptions id="padding-free">
 <hfoption id="DPO">
@@ -197,21 +182,19 @@ from trl import SFTConfig
 training_args = SFTConfig(..., activation_offloading=True)
 ```
 
-<Tip warning={true}>
-
-When using activation offloading with models that use Liger kernels, you must disable Liger cross entropy due to compatibility issues. The issue occurs specifically with `use_liger_kernel=True` because Liger cross entropy performs in-place operations which conflict with activation offloading. The default setting (`use_liger_kernel=False`) works:
-
-```python
-# When using activation offloading with a model that uses Liger kernels:
-from trl import SFTConfig
-
-training_args = SFTConfig(
-    activation_offloading=True,
-    use_liger_kernel=False,  # Disable Liger cross entropy
-    # Other parameters...
-)
-```
-</Tip>
+> [!WARNING]
+> When using activation offloading with models that use Liger kernels, you must disable Liger cross entropy due to compatibility issues. The issue occurs specifically with `use_liger_kernel=True` because Liger cross entropy performs in-place operations which conflict with activation offloading. The default setting (`use_liger_kernel=False`) works:
+>
+> ```python
+> # When using activation offloading with a model that uses Liger kernels:
+> from trl import SFTConfig
+>
+> training_args = SFTConfig(
+>     activation_offloading=True,
+>     use_liger_kernel=False,  # Disable Liger cross entropy
+>     # Other parameters...
+> )
+> ```
 
 Under the hood, activation offloading implements PyTorch's [`saved_tensors_hooks`](https://pytorch.org/tutorials/intermediate/autograd_saved_tensors_hooks_tutorial.html#hooks-for-autograd-saved-tensors) to intercept activations during the forward pass. It intelligently manages which tensors to offload based on size and context, avoiding offloading output tensors which would be inefficient. For performance optimization, it can optionally use CUDA streams to overlap computation with CPU-GPU transfers.
 
diff --git a/docs/source/rloo_trainer.md b/docs/source/rloo_trainer.md
index e00bd7ece25..891a0bcb0f0 100644
--- a/docs/source/rloo_trainer.md
+++ b/docs/source/rloo_trainer.md
@@ -84,23 +84,20 @@ $$
 
 where  \\( \beta > 0 \\) controls the strength of the KL penalty.
 
-<Tip>  
-
-In a purely online setting (`num_iterations = 1`, default), the data are generated by the current policy. In this case, the KL penalty is computed directly using the current policy.  
-
-In the more general setting (e.g., multiple gradient steps per batch), the data are instead generated by an earlier snapshot \\( \pi_{\text{old}} \\). To keep the penalty consistent with the sampling distribution, the KL is defined with respect to this policy:
-
-$$
-\mathbb{D}_{\mathrm{KL}}\!\left[\pi_{\text{old}} \,\|\, \pi_{\text{ref}}\right].
-$$
-
-Equivalently, for a sampled sequence $o$, the Monte Carlo estimate is
-
-$$
-\mathbb{D}_{\mathrm{KL}}\!\left[\pi_{\text{old}} \|\pi_{\mathrm{ref}}\right] = \sum_{t=1}^T \log \frac{\pi_{\text{old}}(o_{i,t} \mid q, o_{i,<t})}{\pi_{\mathrm{ref}}(o_{i,t} \mid q, o_{i,<t})}.
-$$
-
-</Tip>
+> [!TIP]
+> In a purely online setting (`num_iterations = 1`, default), the data are generated by the current policy. In this case, the KL penalty is computed directly using the current policy.  
+>
+> In the more general setting (e.g., multiple gradient steps per batch), the data are instead generated by an earlier snapshot \\( \pi_{\text{old}} \\). To keep the penalty consistent with the sampling distribution, the KL is defined with respect to this policy:
+>
+> $$
+> \mathbb{D}_{\mathrm{KL}}\!\left[\pi_{\text{old}} \,\|\, \pi_{\text{ref}}\right].
+> $$
+>
+> Equivalently, for a sampled sequence $o$, the Monte Carlo estimate is
+>
+> $$
+> \mathbb{D}_{\mathrm{KL}}\!\left[\pi_{\text{old}} \|\pi_{\mathrm{ref}}\right] = \sum_{t=1}^T \log \frac{\pi_{\text{old}}(o_{i,t} \mid q, o_{i,<t})}{\pi_{\mathrm{ref}}(o_{i,t} \mid q, o_{i,<t})}.
+> $$
 
 ### Computing the advantage
 
@@ -195,11 +192,8 @@ In this mode, vLLM runs in a separate process (and using separate GPUs) and comm
    )
    ```
 
-<Tip warning={true}>
-
-Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.
-
-</Tip>
+> [!WARNING]
+> Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the `CUDA_VISIBLE_DEVICES` environment variable.
 
 #### 🧩 Option 2: Colocate mode
 
@@ -215,30 +209,24 @@ training_args = RLOOConfig(
 )
 ```
 
-<Tip>
-
-Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`RLOOConfig`] to avoid underutilization or out-of-memory errors.
-
-We provide a [HF Space](https://huggingface.co/spaces/trl-lib/recommend-vllm-memory) to help estimate the recommended GPU memory utilization based on your model configuration and experiment settings. Simply use it as follows to get `vllm_gpu_memory_utilization` recommendation:
-
-<iframe
-	src="https://trl-lib-recommend-vllm-memory.hf.space"
-	frameborder="0"
-	width="850"
-	height="450"
-></iframe>
-
-If the recommended value does not work in your environment, we suggest adding a small buffer (e.g., +0.05 or +0.1) to the recommended value to ensure stability.
-
-If you still find you are getting out-of-memory errors set `vllm_enable_sleep_mode` to True and the vllm parameters and cache will be offloaded during the optimization step. For more information, see [Reducing Memory Usage with vLLM Sleep Mode](reducing_memory_usage#vllm-sleep-mode).
-
-</Tip>
-
-<Tip>
-
-By default, RLOO uses `MASTER_ADDR=localhost` and `MASTER_PORT=12345` for vLLM, but you can override these values by setting the environment variables accordingly.
-
-</Tip>
+> [!TIP]
+> Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the `vllm_gpu_memory_utilization` parameter in [`RLOOConfig`] to avoid underutilization or out-of-memory errors.
+>
+> We provide a [HF Space](https://huggingface.co/spaces/trl-lib/recommend-vllm-memory) to help estimate the recommended GPU memory utilization based on your model configuration and experiment settings. Simply use it as follows to get `vllm_gpu_memory_utilization` recommendation:
+>
+> <iframe
+> 	src="https://trl-lib-recommend-vllm-memory.hf.space"
+> 	frameborder="0"
+> 	width="850"
+> 	height="450"
+> ></iframe>
+>
+> If the recommended value does not work in your environment, we suggest adding a small buffer (e.g., +0.05 or +0.1) to the recommended value to ensure stability.
+>
+> If you still find you are getting out-of-memory errors set `vllm_enable_sleep_mode` to True and the vllm parameters and cache will be offloaded during the optimization step. For more information, see [Reducing Memory Usage with vLLM Sleep Mode](reducing_memory_usage#vllm-sleep-mode).
+
+> [!TIP]
+> By default, RLOO uses `MASTER_ADDR=localhost` and `MASTER_PORT=12345` for vLLM, but you can override these values by setting the environment variables accordingly.
 
 For more information, see [Speeding up training with vLLM](speeding_up_training#vllm-for-fast-generation-in-online-methods).
 
@@ -534,11 +522,8 @@ Tested with:
 - **Qwen2.5-VL** — e.g., `Qwen/Qwen2.5-VL-3B-Instruct`
 - **SmolVLM2** — e.g., `HuggingFaceTB/SmolVLM2-2.2B-Instruct`
   
-<Tip>
-
-Compatibility with all VLMs is not guaranteed. If you believe a model should be supported, feel free to open an issue on GitHub — or better yet, submit a pull request with the required changes.
-
-</Tip>
+> [!TIP]
+> Compatibility with all VLMs is not guaranteed. If you believe a model should be supported, feel free to open an issue on GitHub — or better yet, submit a pull request with the required changes.
 
 ### Quick Start
 
@@ -564,11 +549,8 @@ accelerate launch \
 
 ### Configuration Tips
 
-<Tip warning={true}>
-
-VLM training may fail if image tokens are truncated. We highly recommend disabling truncation by setting `max_prompt_length` to `None`.
-
-</Tip>
+> [!WARNING]
+> VLM training may fail if image tokens are truncated. We highly recommend disabling truncation by setting `max_prompt_length` to `None`.
 
 - Use LoRA on vision-language projection layers
 - Enable 4-bit quantization to reduce memory usage
diff --git a/docs/source/sft_trainer.md b/docs/source/sft_trainer.md
index bf5bdffdd68..76db2512bbc 100644
--- a/docs/source/sft_trainer.md
+++ b/docs/source/sft_trainer.md
@@ -105,11 +105,9 @@ $$
   
 where  \\( y_t \\) is the target token at timestep  \\( t \\), and the model is trained to predict the next token given the previous ones. In practice, padding tokens are masked out during loss computation.
 
-<Tip>
-
-[On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification](https://huggingface.co/papers/2508.05629) proposes an alternative loss function, called **Dynamic Fine-Tuning (DFT)**, which aims to improve generalization by rectifying the reward signal. This method can be enabled by setting `loss_type="dft"` in the [`SFTConfig`]. For more details, see [Paper Index - Dynamic Fine-Tuning](paper_index#on-the-generalization-of-sft-a-reinforcement-learning-perspective-with-reward-rectification).
-
-</Tip>
+> [!TIP]
+> 
+> [On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification](https://huggingface.co/papers/2508.05629) proposes an alternative loss function, called **Dynamic Fine-Tuning (DFT)**, which aims to improve generalization by rectifying the reward signal. This method can be enabled by setting `loss_type="dft"` in the [`SFTConfig`]. For more details, see [Paper Index - Dynamic Fine-Tuning](paper_index#on-the-generalization-of-sft-a-reinforcement-learning-perspective-with-reward-rectification).
 
 ### Label shifting and masking
 
@@ -180,11 +178,8 @@ To train on completion only, use a [prompt-completion](dataset_formats#prompt-co
 
 ![train_on_completion](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/train_on_completion.png)
 
-<Tip>
-
-Training on completion only is compatible with training on assistant messages only. In this case, use a [conversational](dataset_formats#conversational) [prompt-completion](dataset_formats#prompt-completion) dataset and set `assistant_only_loss=True` in the [`SFTConfig`].
-
-</Tip>
+> [!TIP]
+> Training on completion only is compatible with training on assistant messages only. In this case, use a [conversational](dataset_formats#conversational) [prompt-completion](dataset_formats#prompt-completion) dataset and set `assistant_only_loss=True` in the [`SFTConfig`].
 
 ### Train adapters with PEFT
 
@@ -224,15 +219,12 @@ trainer = SFTTrainer(
 trainer.train()
 ```
 
-<Tip>
-
-When training adapters, you typically use a higher learning rate (≈1e‑4) since only new parameters are being learned.
-
-```python
-SFTConfig(learning_rate=1e-4, ...)
-```
-
-</Tip>
+> [!TIP]
+> When training adapters, you typically use a higher learning rate (≈1e‑4) since only new parameters are being learned.
+>
+> ```python
+> SFTConfig(learning_rate=1e-4, ...)
+> ```
 
 ### Train with Liger Kernel
 
@@ -315,17 +307,14 @@ trainer = SFTTrainer(
 trainer.train()
 ```
 
-<Tip>
-
-For VLMs, truncating may remove image tokens, leading to errors during training. To avoid this, set `max_length=None` in the [`SFTConfig`]. This allows the model to process the full sequence length without truncating image tokens.
-
-```python
-SFTConfig(max_length=None, ...)
-```
-
-Only use `max_length` when you've verified that truncation won't remove image tokens for the entire dataset.
-
-</Tip>
+> [!TIP]
+> For VLMs, truncating may remove image tokens, leading to errors during training. To avoid this, set `max_length=None` in the [`SFTConfig`]. This allows the model to process the full sequence length without truncating image tokens.
+>
+> ```python
+> SFTConfig(max_length=None, ...)
+> ```
+>
+> Only use `max_length` when you've verified that truncation won't remove image tokens for the entire dataset.
 
 ## SFTTrainer
 
diff --git a/docs/source/speeding_up_training.md b/docs/source/speeding_up_training.md
index 57586295f8f..e6da1d18dc4 100644
--- a/docs/source/speeding_up_training.md
+++ b/docs/source/speeding_up_training.md
@@ -1,10 +1,7 @@
 # Speeding Up Training
 
-<Tip warning={true}>
-
-Section under construction. Feel free to contribute!
-
-</Tip>
+> [!WARNING]
+> Section under construction. Feel free to contribute!
 
 ## vLLM for fast generation in online methods
 
@@ -47,21 +44,18 @@ training_args = GRPOConfig(..., use_vllm=True)
 
 You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration).
 
-<Tip warning={true}>
-
-When using vLLM, ensure that the GPUs assigned for training and generation are separate to avoid resource conflicts. For instance, if you plan to use 4 GPUs for training and another 4 for vLLM generation, you can specify GPU allocation using `CUDA_VISIBLE_DEVICES`.  
-
-Set GPUs **0-3** for vLLM generation:  
-```sh
-CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model <model_name>
-```  
-
-And GPUs **4-7** for training:  
-```sh
-CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py
-```  
-
-</Tip>
+> [!WARNING]
+> When using vLLM, ensure that the GPUs assigned for training and generation are separate to avoid resource conflicts. For instance, if you plan to use 4 GPUs for training and another 4 for vLLM generation, you can specify GPU allocation using `CUDA_VISIBLE_DEVICES`.  
+>
+> Set GPUs **0-3** for vLLM generation:  
+> ```sh
+> CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model <model_name>
+> ```  
+>
+> And GPUs **4-7** for training:  
+> ```sh
+> CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py
+> ```
 
 </hfoption>
 <hfoption id="RLOO">
@@ -82,21 +76,18 @@ training_args = RLOOConfig(..., use_vllm=True)
 
 You can customize the server configuration by passing additional arguments. For more information, see [vLLM integration](vllm_integration).
 
-<Tip warning={true}>
-
-When using vLLM, ensure that the GPUs assigned for training and generation are separate to avoid resource conflicts. For instance, if you plan to use 4 GPUs for training and another 4 for vLLM generation, you can specify GPU allocation using `CUDA_VISIBLE_DEVICES`.  
-
-Set GPUs **0-3** for vLLM generation:  
-```sh
-CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model <model_name>
-```  
-
-And GPUs **4-7** for training:  
-```sh
-CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py
-```  
-
-</Tip>
+> [!WARNING]
+> When using vLLM, ensure that the GPUs assigned for training and generation are separate to avoid resource conflicts. For instance, if you plan to use 4 GPUs for training and another 4 for vLLM generation, you can specify GPU allocation using `CUDA_VISIBLE_DEVICES`.  
+>
+> Set GPUs **0-3** for vLLM generation:  
+> ```sh
+> CUDA_VISIBLE_DEVICES=0,1,2,3 trl vllm-serve --model <model_name>
+> ```  
+>
+> And GPUs **4-7** for training:  
+> ```sh
+> CUDA_VISIBLE_DEVICES=4,5,6,7 accelerate launch train.py
+> ```
 
 </hfoption>
 </hfoptions>
diff --git a/docs/source/vllm_integration.md b/docs/source/vllm_integration.md
index 9240aed62ce..b2838215d4c 100644
--- a/docs/source/vllm_integration.md
+++ b/docs/source/vllm_integration.md
@@ -2,19 +2,15 @@
 
 This document will guide you through the process of using vLLM with TRL for faster generation in online methods like GRPO and Online DPO. We first summarize a tl;dr on how to use vLLM with TRL, and then we will go into the details of how it works under the hood. Let's go! 🔥
 
-<Tip warning={true}>
-
-TRL currently only supports vLLM versions `0.10.0`, `0.10.1`, and `0.10.2`. Please ensure you have one of these versions installed to avoid compatibility issues.
-
-</Tip>
+> [!WARNING]
+> TRL currently only supports vLLM versions `0.10.0`, `0.10.1`, and `0.10.2`. Please ensure you have one of these versions installed to avoid compatibility issues.
 
 ## 🚀 How can I use vLLM with TRL to speed up training?
 
 💡 **Note**: Resources required for this specific example: a single node with 8 GPUs.
 
-<Tip warning={true}>
-vLLM server and TRL trainer must use different CUDA devices to avoid conflicts.
-</Tip>
+> [!WARNING]
+> vLLM server and TRL trainer must use different CUDA devices to avoid conflicts.
 
 First, install vLLM using the following command:
 
diff --git a/docs/source/xpo_trainer.md b/docs/source/xpo_trainer.md
index 7bee60400ea..3c9fd69a2b1 100644
--- a/docs/source/xpo_trainer.md
+++ b/docs/source/xpo_trainer.md
@@ -84,11 +84,8 @@ Instead of a judge, you can chose to use a reward model -- see [Reward Bench](ht
   )
 ```
 
-<Tip warning={true}>
-
-Make sure that the SFT model and reward model use the _same_ chat template and the same tokenizer. Otherwise, you may find the model completions are scored incorrectly during training.
-
-</Tip>
+> [!WARNING]
+> Make sure that the SFT model and reward model use the _same_ chat template and the same tokenizer. Otherwise, you may find the model completions are scored incorrectly during training.
 
 ### Encourage EOS token generation
 
diff --git a/trl/trainer/judges.py b/trl/trainer/judges.py
index 9be9394bb6d..5c8c80c3726 100644
--- a/trl/trainer/judges.py
+++ b/trl/trainer/judges.py
@@ -206,11 +206,8 @@ class PairRMJudge(BasePairwiseJudge):
     >>> print(results)  # [0, 1] (indicating the first completion is preferred for the first prompt and the second)
     ```
 
-    <Tip>
-
-    This class requires the llm-blender library to be installed. Install it with: `pip install llm-blender`.
-
-    </Tip>
+    > [!TIP]
+    > This class requires the llm-blender library to be installed. Install it with: `pip install llm-blender`.
     """
 
     def __init__(self):