Controls
Reinforcement Learning
Reinforcement Learning (RL) is a paradigm where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties and learns a policy that maximizes cumulative reward over time.
How to Use
- Pick an environment and algorithm to get started
- Press Train to run episodes automatically
- Press Step to run a single episode
- Toggle Q-Values/Policy to see what the agent learned
- Adjust epsilon to control exploration vs exploitation
Q-Learning (Off-Policy)
Q-Learning updates Q-values using the maximum future Q-value, regardless of the action actually taken. This makes it off-policy — it learns the optimal policy even while exploring randomly.
Q(s,a) ← Q(s,a) + α[r + γ maxa′ Q(s′,a′) - Q(s,a)]
SARSA (On-Policy)
SARSA updates Q-values using the actual next action chosen by the policy (including exploration). This makes it on-policy — it learns the value of the policy it is actually following.
Q(s,a) ← Q(s,a) + α[r + γ Q(s′,a′) - Q(s,a)]
Epsilon-Greedy
With probability ε, choose a random action (explore). Otherwise, choose the action with the highest Q-value (exploit).
Bellman Equation
The optimal Q-function satisfies the Bellman optimality equation:
Q*(s,a) = E[r + γ maxa′ Q*(s′,a′)]
Q-Learning Update
ΔQ = α[r + γ maxa′ Q(s′,a′) - Q(s,a)]
- α (alpha) — Learning rate, controls step size
- γ (gamma) — Discount factor, weights future vs immediate reward
- ε (epsilon) — Exploration rate
- r — Immediate reward received
- s, a — Current state and action
- s′, a′ — Next state and action
SARSA Update
ΔQ = α[r + γ Q(s′,a′) - Q(s,a)]
The key difference: Q-Learning uses max over all actions
at s′, while SARSA uses the action a′ actually chosen by the
current policy.
Training Metrics
| Episode | 0 |
| Steps | 0 |
| Total Reward | 0.00 |
| Epsilon | 0.10 |
| Max Q | 0.00 |
| Converged | No |
| Status | Ready |
Q-Table / Policy
Train the agent to see Q-values and policy arrows here.