Attention Mechanism

Controls

Custom sentences use synthetic attention (not real model output)
5

Scaled Dot-Product Attention

Attention is the core building block of Transformers. Given input tokens, it computes how much each token should attend to every other by projecting them into Query, Key, and Value spaces.

This visualization animates the internal step-by-step computation, complementing the companion Transformer visualization (which shows the final heatmap directly).

Key Insights

  • Q/K/V are learned linear projections of input embeddings
  • Scores measure query-key similarity via dot product
  • Scaling by √dk prevents gradient saturation
  • Softmax converts scores into a probability distribution
  • Output is a weighted mix of Value vectors

How to Use

  • Press Run to animate all phases automatically
  • Press Step to advance one phase at a time
  • Change sentences to see different attention patterns
  • Adjust embed dim to see how vector size affects computation

Computation Pipeline

  1. Embed: Map each token to a d-dimensional vector
  2. Project Q: Multiply embeddings by WQ to get queries
  3. Project K: Multiply embeddings by WK to get keys
  4. Project V: Multiply embeddings by WV to get values
  5. Score: Compute Q · KT (dot product for each pair)
  6. Scale: Divide scores by √dk
  7. Softmax: Apply softmax row-by-row to get attention weights
  8. Aggregate: Multiply weights by V to produce output vectors

Pseudocode

Q = X @ W_Q    # project queries
K = X @ W_K    # project keys
V = X @ W_V    # project values
scores = Q @ K.T          # dot products
scaled = scores / sqrt(d_k)
weights = softmax(scaled)  # per row
output = weights @ V       # weighted sum

Attention Formula

Attention(Q, K, V) = softmax(QKT / √dk) V

Projections

Q = XWQ,   K = XWK,   V = XWV

Where X is the input embedding matrix and WQ, WK, WV are learned weight matrices.

Dot-Product Score

score(i, j) = qi · kj = Σd qid kjd

Softmax

softmax(zi) = ezi / Σj ezj

Scaling by √dk prevents dot products from growing too large, which would push softmax into regions with tiny gradients.

Attention Metrics

Phase Idle
Tokens -
Embed Dim 4
Score Matrix -
Max Weight -
Min Weight -
Scale Factor -
Num Heads 4