Encoder Only 12 Layers 4 Heads d=64

Transformer Attention

Self-attention allows each token to look at every other token and decide how much to focus on it. This visualization shows N stacked layers, with one layer expandable at a time to show full internals. Toggle between three architectures:

  • BERT (encoder-only) — 12 layers of bidirectional self-attention
  • GPT-2 (decoder-only) — 12 layers of causal self-attention (upper triangle masked)
  • Original Transformer (encoder-decoder) — 6 encoder + 6 decoder layers; decoder has masked self-attention plus cross-attention to encoder

How to Use

  • Toggle architecture via buttons in the panel header
  • Click a collapsed layer to expand it and see Q/K/V projections, attention heatmap, and concat
  • Click tokens to select which query to highlight
  • Click chips to expand and see matrices/heatmaps
  • Switch heads via colored squares in the head box
  • Original mode: expand encoder or decoder layers independently

Scaled Dot-Product Attention

  1. Embed: Look up each token in the embedding matrix, add positional encoding
  2. Project: Multiply by WQ, WK, WV to get Query, Key, Value
  3. Score: Compute qi·kj for all token pairs
  4. Scale: Divide by √dk
  5. Softmax: Normalize each row → attention weights
  6. Attend: Weighted sum of Values = output

Multi-Head Attention

Multiple heads run in parallel with different learned WQ, WK, WV. Outputs are concatenated and projected through WO.

Attention(Q,K,V) = softmax(QKT / √dk)V

Live Computation

Select a token and navigate heads to see live computations here.
1.0