Encoder Only
12 Layers
4 Heads
d=64
Transformer Attention
Self-attention allows each token to look at every other token and decide how much to focus on it. This visualization shows N stacked layers, with one layer expandable at a time to show full internals. Toggle between three architectures:
- BERT (encoder-only) — 12 layers of bidirectional self-attention
- GPT-2 (decoder-only) — 12 layers of causal self-attention (upper triangle masked)
- Original Transformer (encoder-decoder) — 6 encoder + 6 decoder layers; decoder has masked self-attention plus cross-attention to encoder
How to Use
- Toggle architecture via buttons in the panel header
- Click a collapsed layer to expand it and see Q/K/V projections, attention heatmap, and concat
- Click tokens to select which query to highlight
- Click chips to expand and see matrices/heatmaps
- Switch heads via colored squares in the head box
- Original mode: expand encoder or decoder layers independently
Scaled Dot-Product Attention
- Embed: Look up each token in the embedding matrix, add positional encoding
- Project: Multiply by WQ, WK, WV to get Query, Key, Value
- Score: Compute qi·kj for all token pairs
- Scale: Divide by √dk
- Softmax: Normalize each row → attention weights
- Attend: Weighted sum of Values = output
Multi-Head Attention
Multiple heads run in parallel with different learned WQ, WK, WV. Outputs are concatenated and projected through WO.
Attention(Q,K,V) = softmax(QKT / √dk)V
Live Computation
Select a token and navigate heads to see live computations here.
1.0