Attention Mechanism | Exploring Artificial Intelligence

Attention Mechanism

Token Query (Q) Key (K) Value (V) Attention Output

Controls

Sentence:

Custom:

Custom sentences use synthetic attention (not real model output)

Data:

Scaled Dot-Product Attention

Attention is the core building block of Transformers. Given input tokens, it computes how much each token should attend to every other by projecting them into Query, Key, and Value spaces.

This visualization animates the internal step-by-step computation, complementing the companion Transformer visualization (which shows the final heatmap directly).

Key Insights

Q/K/V are learned linear projections of input embeddings
Scores measure query-key similarity via dot product
Scaling by √d_k prevents gradient saturation
Softmax converts scores into a probability distribution
Output is a weighted mix of Value vectors

How to Use

Press Run to animate all phases automatically
Press Step to advance one phase at a time
Change sentences to see different attention patterns
Adjust embed dim to see how vector size affects computation

Computation Pipeline

Embed: Map each token to a d-dimensional vector
Project Q: Multiply embeddings by W_Q to get queries
Project K: Multiply embeddings by W_K to get keys
Project V: Multiply embeddings by W_V to get values
Score: Compute Q · K^T (dot product for each pair)
Scale: Divide scores by √d_k
Softmax: Apply softmax row-by-row to get attention weights
Aggregate: Multiply weights by V to produce output vectors

Pseudocode

Q = X @ W_Q    # project queries
K = X @ W_K    # project keys
V = X @ W_V    # project values
scores = Q @ K.T          # dot products
scaled = scores / sqrt(d_k)
weights = softmax(scaled)  # per row
output = weights @ V       # weighted sum

Attention Formula

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Projections

Q = XW_Q, K = XW_K, V = XW_V

Where X is the input embedding matrix and W_Q, W_K, W_V are learned weight matrices.

Dot-Product Score

score(i, j) = q_i · k_j = Σ_d q_id k_jd

Softmax

softmax(z_i) = e^z_i / Σ_j e^z_j

Scaling by √d_k prevents dot products from growing too large, which would push softmax into regions with tiny gradients.

Attention Metrics

Phase	Idle
Tokens	-
Embed Dim (d)	4
Score Matrix	-
Max Weight	-
Min Weight	-
Scale Factor	-
Num Heads	4

E-AI: Attention Mechanism