Transformer & Attention Visualizer | Exploring Artificial Intelligence

Self-Attention

4 Heads d=64

Embedding Q / K / V Attn Head Output Data Flow

Self-Attention

Self-attention allows each token in a sequence to look at every other token and decide how much to focus on it. This is the core mechanism in Transformer models like BERT and GPT.

Each attention head learns different relationships — one might track syntax, another might link pronouns to their referents.

How to Use

Navigate heads with the carousel — arrows, dots, or click a card
Click a token to highlight what it attends to
Expand blocks by clicking to see internal computations
Adjust controls — heads, dimensions, temperature

Scaled Dot-Product Attention

Embed: Look up each token in the embedding matrix, add positional encoding
Project: Multiply by W_Q, W_K, W_V to get Query, Key, Value
Score: Compute q_i·k_j for all token pairs
Scale: Divide by √d_k
Softmax: Normalize each row → attention weights
Attend: Weighted sum of Values = output

Multi-Head Attention

Multiple heads run in parallel with different learned W_Q, W_K, W_V. Outputs are concatenated and projected through W_O.

Attention(Q,K,V) = softmax(QK^T / √d_k)V

Live Computation

Select a token and navigate heads to see live computations here.

Example:

Token:

Heads:

Model Dim:

Temp: 1.0

Positional Enc.

E-AI: Transformer & Attention Visualizer

Self-Attention

How to Use

Scaled Dot-Product Attention

Multi-Head Attention

Live Computation