RLHF Training Pipeline | Exploring Artificial Intelligence

RLHF Training Pipeline

SFT Reward Model PPO Avg Reward KL Divergence

Controls

Scenario:

KL Coeff (β): 0.10

PPO Steps:

RLHF Pipeline

Reinforcement Learning from Human Feedback (RLHF) is the dominant technique for aligning large language models (GPT-4, Claude, LLaMA-2-Chat) with human preferences. It transforms a pre-trained LLM into one that produces helpful, harmless, and honest outputs.

The pipeline has 3 stages:

SFT — Supervised fine-tuning on curated demonstrations
Reward Model — Train a reward model from human preference comparisons
PPO — Optimize the policy against the reward model using Proximal Policy Optimization, with a KL penalty to prevent diverging too far from the reference policy

How to Use

Press Run to animate the full 3-stage pipeline
Press Step to advance one phase at a time
Change scenario to see helpfulness vs. safety alignment
Adjust β to control how much the policy can diverge from reference
Adjust PPO steps to control training duration

Reward Model Training

Uses the Bradley-Terry model for pairwise preferences:

Given prompt x, sample two responses y_w (preferred) and y_l (rejected)
Human labels which response is better
Train reward model r(x, y) to satisfy r(x, y_w) > r(x, y_l)

PPO Loop

for step in 1..N:
  # Sample responses from current policy
  y ~ π_θ(y | x)
  # Score with reward model
  reward = r(x, y)
  # Compute KL penalty
  kl = KL(π_θ || π_ref)
  # PPO objective with KL constraint
  J(θ) = E[r(x,y) - β · kl]
  # Update policy via gradient ascent
  θ ← θ + α ∇_θ J(θ)

Reward Model Loss

L_RM = −log σ(r(x, y_w) − r(x, y_l))

Bradley-Terry loss: the reward of the preferred response should exceed the rejected one.

PPO Objective

J(θ) = E_x,y[r(x, y) − β · KL(π_θ || π_ref)]

Maximize reward while staying close to the reference policy via KL penalty.

KL Divergence

KL(P || Q) = Σ_i P(i) log(P(i) / Q(i))

Measures how much the current policy π_θ has diverged from the reference π_ref. Higher β penalizes divergence more strongly.

Advantage

A(y) = r(x, y) − mean(r(x, y_i))

Positive advantage means this response is better than average; negative means worse.

RLHF Metrics

Phase	Idle
PPO Step	- / 8
Avg Reward	-
KL Divergence	-
RM Accuracy	-
Policy Entropy	-

E-AI: RLHF Training Pipeline