Controls
RLHF Pipeline
Reinforcement Learning from Human Feedback (RLHF) is the dominant technique for aligning large language models (GPT-4, Claude, LLaMA-2-Chat) with human preferences. It transforms a pre-trained LLM into one that produces helpful, harmless, and honest outputs.
The pipeline has 3 stages:
- SFT — Supervised fine-tuning on curated demonstrations
- Reward Model — Train a reward model from human preference comparisons
- PPO — Optimize the policy against the reward model using Proximal Policy Optimization, with a KL penalty to prevent diverging too far from the reference policy
How to Use
- Press Run to animate the full 3-stage pipeline
- Press Step to advance one phase at a time
- Change scenario to see helpfulness vs. safety alignment
- Adjust β to control how much the policy can diverge from reference
- Adjust PPO steps to control training duration
Reward Model Training
Uses the Bradley-Terry model for pairwise preferences:
- Given prompt x, sample two responses yw (preferred) and yl (rejected)
- Human labels which response is better
- Train reward model r(x, y) to satisfy r(x, yw) > r(x, yl)
PPO Loop
for step in 1..N: # Sample responses from current policy y ~ πθ(y | x) # Score with reward model reward = r(x, y) # Compute KL penalty kl = KL(πθ || πref) # PPO objective with KL constraint J(θ) = E[r(x,y) - β · kl] # Update policy via gradient ascent θ ← θ + α ∇θ J(θ)
Reward Model Loss
LRM = −log σ(r(x, yw) − r(x, yl))
Bradley-Terry loss: the reward of the preferred response should exceed the rejected one.
PPO Objective
J(θ) = Ex,y[r(x, y) − β · KL(πθ || πref)]
Maximize reward while staying close to the reference policy via KL penalty.
KL Divergence
KL(P || Q) = Σi P(i) log(P(i) / Q(i))
Measures how much the current policy πθ has diverged from the reference πref. Higher β penalizes divergence more strongly.
Advantage
A(y) = r(x, y) − mean(r(x, yi))
Positive advantage means this response is better than average; negative means worse.
RLHF Metrics
| Phase | Idle |
| PPO Step | - / 8 |
| Avg Reward | - |
| KL Divergence | - |
| RM Accuracy | - |
| Policy Entropy | - |