RLHF Training Pipeline

Controls

0.10
8
5

RLHF Pipeline

Reinforcement Learning from Human Feedback (RLHF) is the dominant technique for aligning large language models (GPT-4, Claude, LLaMA-2-Chat) with human preferences. It transforms a pre-trained LLM into one that produces helpful, harmless, and honest outputs.

The pipeline has 3 stages:

  1. SFT — Supervised fine-tuning on curated demonstrations
  2. Reward Model — Train a reward model from human preference comparisons
  3. PPO — Optimize the policy against the reward model using Proximal Policy Optimization, with a KL penalty to prevent diverging too far from the reference policy

How to Use

  • Press Run to animate the full 3-stage pipeline
  • Press Step to advance one phase at a time
  • Change scenario to see helpfulness vs. safety alignment
  • Adjust β to control how much the policy can diverge from reference
  • Adjust PPO steps to control training duration

Reward Model Training

Uses the Bradley-Terry model for pairwise preferences:

  1. Given prompt x, sample two responses yw (preferred) and yl (rejected)
  2. Human labels which response is better
  3. Train reward model r(x, y) to satisfy r(x, yw) > r(x, yl)

PPO Loop

for step in 1..N:
  # Sample responses from current policy
  y ~ πθ(y | x)
  # Score with reward model
  reward = r(x, y)
  # Compute KL penalty
  kl = KL(πθ || πref)
  # PPO objective with KL constraint
  J(θ) = E[r(x,y) - β · kl]
  # Update policy via gradient ascent
  θ ← θ + α ∇θ J(θ)

Reward Model Loss

LRM = −log σ(r(x, yw) − r(x, yl))

Bradley-Terry loss: the reward of the preferred response should exceed the rejected one.

PPO Objective

J(θ) = Ex,y[r(x, y) − β · KL(πθ || πref)]

Maximize reward while staying close to the reference policy via KL penalty.

KL Divergence

KL(P || Q) = Σi P(i) log(P(i) / Q(i))

Measures how much the current policy πθ has diverged from the reference πref. Higher β penalizes divergence more strongly.

Advantage

A(y) = r(x, y) − mean(r(x, yi))

Positive advantage means this response is better than average; negative means worse.

RLHF Metrics

Phase Idle
PPO Step - / 8
Avg Reward -
KL Divergence -
RM Accuracy -
Policy Entropy -