Reinforcement Learning Visualizer | Exploring Artificial Intelligence

Grid World Episode: 0

Agent Goal Wall Pit Path

Controls

Environment:

Algorithm:

Epsilon (0.10):

Learning Rate (0.10):

Discount (0.99):

Max Episodes:

500

Q-Values Policy Arrows

Reinforcement Learning

Reinforcement Learning (RL) is a paradigm where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties and learns a policy that maximizes cumulative reward over time.

How to Use

Pick an environment and algorithm to get started
Press Train to run episodes automatically
Press Step to run a single episode
Toggle Q-Values/Policy to see what the agent learned
Adjust epsilon to control exploration vs exploitation

Q-Learning (Off-Policy)

Q-Learning updates Q-values using the maximum future Q-value, regardless of the action actually taken. This makes it off-policy — it learns the optimal policy even while exploring randomly.

Q(s,a) ← Q(s,a) + α[r + γ max_a′ Q(s′,a′) - Q(s,a)]

SARSA (On-Policy)

SARSA updates Q-values using the actual next action chosen by the policy (including exploration). This makes it on-policy — it learns the value of the policy it is actually following.

Q(s,a) ← Q(s,a) + α[r + γ Q(s′,a′) - Q(s,a)]

Epsilon-Greedy

With probability ε, choose a random action (explore). Otherwise, choose the action with the highest Q-value (exploit).

Bellman Equation

The optimal Q-function satisfies the Bellman optimality equation:

Q*(s,a) = E[r + γ max_a′ Q*(s′,a′)]

Q-Learning Update

ΔQ = α[r + γ max_a′ Q(s′,a′) - Q(s,a)]

α (alpha) — Learning rate, controls step size
γ (gamma) — Discount factor, weights future vs immediate reward
ε (epsilon) — Exploration rate
r — Immediate reward received
s, a — Current state and action
s′, a′ — Next state and action

SARSA Update

ΔQ = α[r + γ Q(s′,a′) - Q(s,a)]

The key difference: Q-Learning uses max over all actions at s′, while SARSA uses the action a′ actually chosen by the current policy.

Training Metrics

Episode	0
Steps	0
Total Reward	0.00
Epsilon	0.10
Max Q	0.00
Converged	No
Status	Ready

Q-Table / Policy

Train the agent to see Q-values and policy arrows here.

E-AI: Reinforcement Learning Visualizer