Grid World Episode: 0

Controls

500
5

Reinforcement Learning

Reinforcement Learning (RL) is a paradigm where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties and learns a policy that maximizes cumulative reward over time.

How to Use

  • Pick an environment and algorithm to get started
  • Press Train to run episodes automatically
  • Press Step to run a single episode
  • Toggle Q-Values/Policy to see what the agent learned
  • Adjust epsilon to control exploration vs exploitation

Q-Learning (Off-Policy)

Q-Learning updates Q-values using the maximum future Q-value, regardless of the action actually taken. This makes it off-policy — it learns the optimal policy even while exploring randomly.

Q(s,a) ← Q(s,a) + α[r + γ maxa′ Q(s′,a′) - Q(s,a)]

SARSA (On-Policy)

SARSA updates Q-values using the actual next action chosen by the policy (including exploration). This makes it on-policy — it learns the value of the policy it is actually following.

Q(s,a) ← Q(s,a) + α[r + γ Q(s′,a′) - Q(s,a)]

Epsilon-Greedy

With probability ε, choose a random action (explore). Otherwise, choose the action with the highest Q-value (exploit).

Bellman Equation

The optimal Q-function satisfies the Bellman optimality equation:

Q*(s,a) = E[r + γ maxa′ Q*(s′,a′)]

Q-Learning Update

ΔQ = α[r + γ maxa′ Q(s′,a′) - Q(s,a)]
  • α (alpha) — Learning rate, controls step size
  • γ (gamma) — Discount factor, weights future vs immediate reward
  • ε (epsilon) — Exploration rate
  • r — Immediate reward received
  • s, a — Current state and action
  • s′, a′ — Next state and action

SARSA Update

ΔQ = α[r + γ Q(s′,a′) - Q(s,a)]

The key difference: Q-Learning uses max over all actions at s′, while SARSA uses the action a′ actually chosen by the current policy.

Training Metrics

Episode 0
Steps 0
Total Reward 0.00
Epsilon 0.10
Max Q 0.00
Converged No
Status Ready

Q-Table / Policy

Train the agent to see Q-values and policy arrows here.