Controls
Scaled Dot-Product Attention
Attention is the core building block of Transformers. Given input tokens, it computes how much each token should attend to every other by projecting them into Query, Key, and Value spaces.
This visualization animates the internal step-by-step computation, complementing the companion Transformer visualization (which shows the final heatmap directly).
Key Insights
- Q/K/V are learned linear projections of input embeddings
- Scores measure query-key similarity via dot product
- Scaling by √dk prevents gradient saturation
- Softmax converts scores into a probability distribution
- Output is a weighted mix of Value vectors
How to Use
- Press Run to animate all phases automatically
- Press Step to advance one phase at a time
- Change sentences to see different attention patterns
- Adjust embed dim to see how vector size affects computation
Computation Pipeline
- Embed: Map each token to a d-dimensional vector
- Project Q: Multiply embeddings by WQ to get queries
- Project K: Multiply embeddings by WK to get keys
- Project V: Multiply embeddings by WV to get values
- Score: Compute Q · KT (dot product for each pair)
- Scale: Divide scores by √dk
- Softmax: Apply softmax row-by-row to get attention weights
- Aggregate: Multiply weights by V to produce output vectors
Pseudocode
Q = X @ W_Q # project queries K = X @ W_K # project keys V = X @ W_V # project values scores = Q @ K.T # dot products scaled = scores / sqrt(d_k) weights = softmax(scaled) # per row output = weights @ V # weighted sum
Attention Formula
Attention(Q, K, V) = softmax(QKT / √dk) V
Projections
Q = XWQ, K = XWK, V = XWV
Where X is the input embedding matrix and WQ, WK, WV are learned weight matrices.
Dot-Product Score
score(i, j) = qi · kj = Σd qid kjd
Softmax
softmax(zi) = ezi / Σj ezj
Scaling by √dk prevents dot products from growing too large, which would push softmax into regions with tiny gradients.
Attention Metrics
| Phase | Idle |
| Tokens | - |
| Embed Dim | 4 |
| Score Matrix | - |
| Max Weight | - |
| Min Weight | - |
| Scale Factor | - |
| Num Heads | 4 |