Controls
Self-Attention
Self-attention allows each token in a sequence to look at every other token and decide how much to focus on it. This is the core mechanism in Transformer models like BERT and GPT.
Key Insights
- Parallelizable — unlike RNNs, all positions are processed simultaneously
- Long-range dependencies — any token can attend to any other regardless of distance
- Multi-head — multiple attention heads capture different relationships
How to Use
- Enter a sentence and click Compute Attention
- Hover over cells to see attention weights for a token pair
- Switch heads to see different attention patterns
- Adjust temperature to sharpen or flatten attention
Q/K/V Computation
- Each token embedding
xiis projected into three vectors: - Query:
qi = WQ xi - Key:
ki = WK xi - Value:
vi = WV xi
Scaled Dot-Product Steps
- Compute scores:
sij = qi · kj - Scale:
sij / √dk - Apply softmax row-wise to get weights
- Weighted sum of values gives the output
Multi-Head Attention
Multiple heads run in parallel with different learned projections, then their outputs are concatenated and projected back.
Attention Formula
Attention(Q,K,V) = softmax(QKT/√dk)V
Where dk is the dimension of the key vectors. Scaling prevents dot products from becoming too large.
Softmax
softmax(zi) = ezi / Σj ezj
Positional Encoding
PE(pos,2i) = sin(pos / 100002i/d)
PE(pos,2i+1) = cos(pos / 100002i/d)
Sinusoidal positional encodings inject word order information into the model, since attention itself is permutation-invariant.
Attention Metrics
| Tokens | 6 |
| Heads | 4 |
| Dimension | 64 |
| Max Attention | - |
| Min Attention | - |
| Status | Ready |