Learning RLHF (PPO) with codes (Huggingface TRL)

I briefly learned PPO in the RL and Modern NLP, but I didn’t quite grasp it, so over the past couple of days, I glanced at the source code implementation on TRL - Transformer Reinforcement Learning (huggingface.co) and the paper [2307.04964] Secrets of RLHF in Large Language Models Part I: PPO (arxiv.org). I feel much clearer about it now. (Seems like there are quite a few errors in the paper…

First, let’s go over the PPO workflow:

PPO

We start with an initial state where we have an SFT Model and its “clone brother”, the RL Model, both referred to as policies. There’s also a Value Model. In PPO, there’s a process called GAE, which requires the use of Advantage, TD-Error, and Return. Then, finally, through these calculated elements, we derive the PPO-clip Loss, LM Loss, and MSE Loss.

RL Prerequisites

Policy Gradient

The core objective of RL is to optimize the RL Model (\(\pi_{\theta}^{\text{RL}}\)). Based on the Policy Gradient and the Log-likelihood Trick, our goal is to maximize the Return under the RL policy (which can initially be understood as Reward; it’s actually a multi-time step total discounted reward), using Gradient Ascent:

Policy Gradient

Calculating this through Monte-Carlo Sampling leads to high variance, so subtracting the baseline is a better approach. In RL courses, it’s known that the baseline is generally the expected reward, which is the V-Value.

Here, I’m not sure why the second and third \(\Phi_{t}\) are not the total discounted reward.

Advantage, V-Value, and Q-Value

It was my first time learning about the relationship between Advantage and Q-Value. After deriving from the Bellman equation for one step, it appears to be correct indeed. I’ve always learned in class that \(A_{t}=r_{t}+\gamma V(s_{t+1})-V(s_{t})\):

Advantage formula

A2C (Advantage Actor-Critic)

According to Policy Gradient, we can train both a policy model and a value model. The value model should be as close as possible to the total discounted reward (return), hence the use of MSE loss, while the policy model should aim to maximize the return, hence equation (5) is used:

A2C formula

Generalized Advantage Estimation

For multi-time step advantage calculation, see here:

GAE formula

With a small \(k\), the bias is large. With a large \(k\), the variance is large. A trade-off is needed, leading to the use of GAE:

GAE extended formula GAE loop formula

This can be simplified into a loop using the algorithm by Qin Jiushao:

GAE simplified loop

This is how it’s implemented in the TRL code.

PPO improves upon the A2C foundation.

RLHF-PPO

PPO adds constraints on top of A2C to prevent too significant updates in the policy model, ensuring the updated policy model “does not deviate too much” from the previous. The principles have been discussed in RL courses, with lengthy proof formulas, hence not listed here.

Reward

This is part of the training process. The RLHF’s Reward Function is pre-trained:

Training process

In PPO, total reward is calculated as follows:

Total reward calculation

Note: The KL here and the KL in the later Update Policy section are not exactly the same. Here, KL is calculated between the old policy before each Update Policy and the original SFT model, whereas later, KL is calculated between the policy before and after the update! In the code, the reward is added to the action of generating the last token, with the rest being 0, adhering to the principle of sparsity.

Update Policy:

TRPO

TRPO

PPO-CLIP

PPO-CLIP formula

It’s essentially a one-way clip, preventing the model from trying too hard in a good/bad direction when the action is good/bad. Detailed explanation:

PPO-CLIP explanation

Update Value

The code also clips the value, then takes the max loss, which is somewhat unusual:

Update Value formula

Mixing Pretraining Gradients

This part was done by InstructGPT and seems not to be implemented in TRL. It corresponds to the LM loss in the workflow:

Mixing Pretraining Gradients

Pseudocode

The pseudocode and A2C differ slightly from the workflow diagram; in TRL, I did not observe an experience buffer/sampling:

Pseudocode

Code Implementation

Overall

When using, simply call the ppo_trainer.step() function. This function’s internal structure is detailed in the TRL GitHub repository:

trl/trl/trainer/ppo_trainer.py at v0.7.1 · huggingface/trl (github.com)

Overall structure

Upon creating the trainer, you must pass in an SFT model. The trainer clones it into a reference model (an unupdated SFT) and a policy model, either AutoModelForCausalLMWithValueHead or AutoModelForSeq2SeqLMWithValueHead. This Value Head acts as the Value Model.

Before entering the step, query and response embeddings are fed in batches, along with the reward for the response. You need to:

for batch in ppo_trainer.dataloader:
	query_tensors = get_query_tensors()
	response_tensors = generate_response_tensors_given_query_with_policy_LM()
	rewards = get_reward_score_for_the_response()
	stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

Inside the step function, the process begins with two significant operations around lines 665 and 680:

Two significant operations

These operations involve model forward passes and reward computations, as highlighted below:

How to Do Model Forward

The batched_forward_pass() function calculates the RL model’s policy probabilities \(\pi_{\theta_{\text{old}}}^{\text{RL}}(y_{w}\vert x)\) (all_logprobs), the SFT model’s policy probabilities \(\pi^{\text{SFT}}(y_{w}\vert x)\) (ref_logprobs), and the values \(V(x)\) (values).

trl/trl/trainer/ppo_trainer.py at v0.7.1 · huggingface/trl (github.com)

batched_forward_pass function Details of batched_forward_pass

The logprobs function calculates the probabilities based on the logits, which represent the scores for all tokens in the vocabulary at each position in the sequence. After applying softmax, the probabilities of the tokens appearing in the response are determined. The comparison involves decoder input from index 1 and decoder output from 0 to the second-to-last token.

Logprobs calculation

Additionally, a mask (Line 959) is applied to ignore the first token and calculate values only for tokens from index 1 to before the padding token.

In Line 946, logits, _, values = model(**input_kwargs), what is the meaning of values here? Does the forward function in the HuggingFace have this return value? It turns out that TRL uses LMWithValueHead here, letting the model have a value_head attribute, which is a linear layer.

Value Head

The value head is shared with the policy model’s underlying parameters but swaps the top layer for a linear model.

trl/trl/models/modeling_value_head.py at v0.7.1 · huggingface/trl (github.com)

Value Head structure

The self.v_head() function requires the last hidden state to predict the value.

Value Head operation

Further details on Value Head

Reward

The compute_reward() function is crucial for calculating the rewards based on the policy probabilities and SFT probabilities computed earlier.

compute_reward function

The rewards are calculated for each token treated as an action, computing \(r−\beta×KL\), with \(r\) having a value only for the last token, as per design.

Advantage (GAE)

Advantage (GAE) calculation

The Generalized Advantage Estimation (GAE) calculates values, advantages, and returns, incorporating a mask for calculation.

Details on GAE calculation

PPO Update

All data is stored as old outputs before iterating over minibatches for self.config.ppo_epochs epochs defined by self.config.ppo_epochs.

PPO Update preparation Minibatch iteration

The minibatch dictionary mini_batch_dict calculates the information before the PPO update.

Minibatch information

The training of minibatches combines the Policy and Value losses for joint updates.

Training minibatches Combined loss update

The loss function is specially tailored, with the value loss being clipped and averaged over unmasked values, while the policy gradient loss remains standard.

Loss function details

This comprehensive overview explains how the step function integrates various components of the PPO algorithm, from calculating probabilities and values to updating the policy and value models based on computed rewards and advantages, ultimately refining the RL model’s behavior towards desired outcomes.