Self-Attention
4 Heads
d=64
Self-Attention
Self-attention allows each token in a sequence to look at every other token and decide how much to focus on it. This is the core mechanism in Transformer models like BERT and GPT.
Each attention head learns different relationships — one might track syntax, another might link pronouns to their referents.
How to Use
- Navigate heads with the carousel — arrows, dots, or click a card
- Click a token to highlight what it attends to
- Expand blocks by clicking to see internal computations
- Adjust controls — heads, dimensions, temperature
Scaled Dot-Product Attention
- Embed: Look up each token in the embedding matrix, add positional encoding
- Project: Multiply by WQ, WK, WV to get Query, Key, Value
- Score: Compute qi·kj for all token pairs
- Scale: Divide by √dk
- Softmax: Normalize each row → attention weights
- Attend: Weighted sum of Values = output
Multi-Head Attention
Multiple heads run in parallel with different learned WQ, WK, WV. Outputs are concatenated and projected through WO.
Attention(Q,K,V) = softmax(QKT / √dk)V
Live Computation
Select a token and navigate heads to see live computations here.
1.0