Self-Attention
4 Heads d=64

Self-Attention

Self-attention allows each token in a sequence to look at every other token and decide how much to focus on it. This is the core mechanism in Transformer models like BERT and GPT.

Each attention head learns different relationships — one might track syntax, another might link pronouns to their referents.

How to Use

  • Navigate heads with the carousel — arrows, dots, or click a card
  • Click a token to highlight what it attends to
  • Expand blocks by clicking to see internal computations
  • Adjust controls — heads, dimensions, temperature

Scaled Dot-Product Attention

  1. Embed: Look up each token in the embedding matrix, add positional encoding
  2. Project: Multiply by WQ, WK, WV to get Query, Key, Value
  3. Score: Compute qi·kj for all token pairs
  4. Scale: Divide by √dk
  5. Softmax: Normalize each row → attention weights
  6. Attend: Weighted sum of Values = output

Multi-Head Attention

Multiple heads run in parallel with different learned WQ, WK, WV. Outputs are concatenated and projected through WO.

Attention(Q,K,V) = softmax(QKT / √dk)V

Live Computation

Select a token and navigate heads to see live computations here.
1.0