Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion books/nlp/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
- [Hard prompts]() <!-- (llms/prompting/hard.md) -->
- [Fine-tuning]() <!-- (llms/fine_tuning/README.md) -->
- [Supervised Fine-Tuning]() <!-- (llms/fine_tuning/sft.md) -->
- [RLHF]() <!-- (llms/fine_tuning/rlhf.md) -->
- [RLHF](llms/fine_tuning/rlhf.md)
- [DPO]() <!-- (llms/fine_tuning/dpo.md) -->
- [GRPO]() <!-- (llms/fine_tuning/grpo.md) -->
- [PEFT]() <!-- (llms/fine_tuning/peft.md) -->
Expand Down
8 changes: 4 additions & 4 deletions books/nlp/src/llms/efficient_inference/kv_cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,12 +182,12 @@ simultaneously or working with very long context windows.

#### References & Useful Links <!-- markdownlint-disable-line MD001 -->

1. [Liu, Zirui, et al. "Kivi: A tuning-free asymmetric 2bit quantization for kv
cache." arXiv preprint arXiv:2402.02750 (2024).](https://arxiv.org/pdf/2402.02750)
1. [_Liu, Zirui, et al. "Kivi: A tuning-free asymmetric 2bit quantization for kv
cache." arXiv preprint arXiv:2402.02750 (2024)._](https://arxiv.org/pdf/2402.02750)
1. [_Raschka, Sebastian. Build a Large Language Model (From Scratch). Simon and
Schuster, 2024._](https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167)
1. [Rajan, R. "KV Cache - Understanding the Mechanism behind it." R4J4N Blogs,
r4j4n.github.io/blogs/posts/kv/. Accessed 27 Feb. 2025.](https://r4j4n.github.io/blogs/posts/kv/)
1. [_Rajan, R. "KV Cache - Understanding the Mechanism behind it." R4J4N Blogs,
r4j4n.github.io/blogs/posts/kv/. Accessed 27 Feb. 2025._](https://r4j4n.github.io/blogs/posts/kv/)

<!-- Contributions -->

Expand Down
93 changes: 93 additions & 0 deletions books/nlp/src/llms/fine_tuning/rlhf.md
Original file line number Diff line number Diff line change
@@ -1 +1,94 @@
<!-- markdownlint-disable-file MD033 -->

# RLHF

<!-- Header -->

{{ #aipr_header }}

<!-- Main Body -->

## Intro and Motivation for RLHF

With pre-training on a large corpus of text, the LLM acquires a foundational
knowledge and is able to perform several tasks quite remarkably. Post pre-training,
we can fine-tune the LLM to increase the performance of an LLM in a certain domain
of text and/or a certain task (e.g., instruction fine-tuning). However, such fine-tuning
does not address the issue that arises due to the fact that language provides infinitely
many possible answers, where some are preferred (i.e. since they are less harmful)
more than others. That is, the issue of how we can further enhance the LLM to
respond in the ways that are preferred more than others.

Concretely, this phase of post-training is called _Alignment_, where the objective
is to adjust the LLM so that it responds in the preferred manner. Preference can
range in more serious forms such as less toxic or harmful as well as less more playful
in the sense of certain styles as well as verbosity and tone.

One of the more well-known methods for alignment fine-tuning is called Reinforcement
Learning with Human Feedback (RLHF), which is the main topic for this pocket reference.
In the next sections, we provide a brief overview of what RLHF involves and its limitations.

## Reinforcement Learning for Alignment of LLMs

Just how reinforcement learning techniques apply to LLMs may not be immediately
apparent. In this section, we clarify specifically how these techniques can be
leveraged for alignment fine-tuning of LLMs.

### A brief primer on reinforcement learning

In reinforcement learning, there is an environment and a so-called agent which
interacts with it. The agent observes the current state of the environment, takes
an action, which then alters the environment's state and produces a quantifiable
reward for the agent. The main objective of reinforcement learning is to optimize
these rewards by learning an optimal policy—a strategy for taking actions based
on the current environmental state.

Methods for finding the optimal policy depend on certain parameters of the reinforcement
learning problem, such as whether or not the state space and action spaces are finite.

#### Policy Gradient Methods

For LLM alignment fine-tuning, we make use of a branch of reinforcement learning
techniques called _Policy Gradient Methods_ which approximates the optimal policy
by first representing it as a smooth parametric function from the state space to
the action space.

#### Proximal Policy Optimization (PPO)

### Steps to perform RLHF

Performing RLHF on an LLM involves the following two high-level steps:

1. **Building a reward model**
2. **Fine-tuning the LLM using PPO**

#### Building a reward model

- Still an LLM
- Change prediction head to a classification head
- [Bradley-Terry model] Predict which of two generated responses preferred by human

#### Fine-tuning the LLM using PPO

## Limitations

## Alternatives

### DPO

#### References & Useful Links <!-- markdownlint-disable-line MD001 -->

1. [_Barto, Andrew G. "Reinforcement learning: An introduction. by richard’s sutton."
SIAM Rev 6.2 (2021): 423._](http://www.incompleteideas.net/book/the-book.html)
1. [_Raschka, Sebastian. "LLM Training: RLHF and Its Alternatives." Sebastian
Raschka's Magazine, accessed 8 Mar. 2025,
magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives._](https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives)
1. [_Raschka, Sebastian. "Tips for LLM Pretraining and Evaluating Reward Models"
Sebastian Raschka's Magazine, accessed 8 Mar. 2025,
magazine.sebastianraschka.com/p/tips-for-llm-pretraining-and-evaluating-rms._](https://magazine.sebastianraschka.com/p/tips-for-llm-pretraining-and-evaluating-rms)
1. [_Lambert, Nathan, et al. "Rewardbench: Evaluating reward models for language
modeling." arXiv preprint arXiv:2403.13787 (2024)._](https://arxiv.org/abs/2403.13787)

<!-- Contributions -->

{{ #author nerdai }}