Attention Heatmap
Head 1

Controls

1.0

Self-Attention

Self-attention allows each token in a sequence to look at every other token and decide how much to focus on it. This is the core mechanism in Transformer models like BERT and GPT.

Key Insights

  • Parallelizable — unlike RNNs, all positions are processed simultaneously
  • Long-range dependencies — any token can attend to any other regardless of distance
  • Multi-head — multiple attention heads capture different relationships

How to Use

  • Enter a sentence and click Compute Attention
  • Hover over cells to see attention weights for a token pair
  • Switch heads to see different attention patterns
  • Adjust temperature to sharpen or flatten attention

Q/K/V Computation

  • Each token embedding xi is projected into three vectors:
  • Query: qi = WQ xi
  • Key: ki = WK xi
  • Value: vi = WV xi

Scaled Dot-Product Steps

  1. Compute scores: sij = qi · kj
  2. Scale: sij / √dk
  3. Apply softmax row-wise to get weights
  4. Weighted sum of values gives the output

Multi-Head Attention

Multiple heads run in parallel with different learned projections, then their outputs are concatenated and projected back.

Attention Formula

Attention(Q,K,V) = softmax(QKT/√dk)V

Where dk is the dimension of the key vectors. Scaling prevents dot products from becoming too large.

Softmax

softmax(zi) = ezi / Σj ezj

Positional Encoding

PE(pos,2i) = sin(pos / 100002i/d)
PE(pos,2i+1) = cos(pos / 100002i/d)

Sinusoidal positional encodings inject word order information into the model, since attention itself is permutation-invariant.

Attention Metrics

Tokens 6
Heads 4
Dimension 64
Max Attention -
Min Attention -
Status Ready

Token Detail


Hover over a cell in the heatmap to see attention details.