Transformers · Week 7 · Session 1

The Transformer & Self-Attention

How every word in a sentence learns to look at every other word, in parallel, with no recurrence at all.

~20 min read Exam weight: heavy on the Final and the attention coding viva Builds on Seq2Seq & Bahdanau attention

An RNN reads a sentence one word at a time and squeezes everything it has seen into a single hidden state. By the time it reaches word 50, word 1 is a faint memory, and you cannot start word 2 until word 1 is done. The Transformer throws out the loop. Every word looks at every other word at once, and the strength of each look is a learned number, $\text{softmax}(QK^\top/\sqrt{d_k})$. That one idea, scaled to billions of parameters, is what powers GPT and BERT.

1.The problem with recurrence

A recurrent network has two costs baked into its shape. First, it is sequential: hidden state $h_t$ depends on $h_{t-1}$, so the words must be processed in order, one after another. You cannot use a GPU's thousands of cores to work on all positions together. Second, information between two distant words has to survive a long chain of updates. The "path length" between word 1 and word 50 is 50 steps, and gradients fade across that path.

Attention removes both costs in one move. Vaswani et al. (2017) showed that a self-attention layer connects any two positions in a single step, so the maximum path length is $O(1)$ instead of $O(n)$, and all positions are computed in parallel. [Attention Is All You Need, Table 1]

Layer type	Complexity per layer	Sequential ops	Max path length
Recurrent	$O(n\cdot d^2)$	$O(n)$	$O(n)$
Self-attention	$O(n^2\cdot d)$	$O(1)$	$O(1)$

Read the trade carefully. Self-attention costs $O(n^2 d)$ because every token compares against every other token, so it is cheaper than recurrence only when the sequence length $n$ is smaller than the model width $d$. For a paragraph that holds; for a whole book it stops holding, which is why long-context models need tricks. But the parallelism and the short path are worth the quadratic cost for the lengths we usually train on.

⊳ First principles

An RNN forces a single channel through time: one hidden vector carries the whole past. Attention replaces that bottleneck with direct wiring. If word 50 needs word 1, it reads word 1 straight from the input, no relay race through 49 intermediate states. The model decides what to read by learning, not by position.

2.Queries, keys, and values

Think of a soft dictionary lookup. You hold a question (a query). Every entry in the dictionary has a label (a key) and some content (a value). A hard lookup returns the one value whose key matches exactly. Attention does a soft version: it compares your query against every key, turns the match scores into weights that sum to 1, and returns a blend of all the values, weighted by how well each key matched. [Lil'Log, Attention]

In self-attention the dictionary is the sentence itself. Each token's embedding $x_i$ is projected three ways by learned matrices:

$$q_i = x_i W^Q,\qquad k_i = x_i W^K,\qquad v_i = x_i W^V$$

Stack them into matrices $Q, K, V$, one row per token. The same input produces all three, which is why it is called self-attention: queries, keys, and values come from one sequence. In the original model the embeddings are 512-dimensional and each of $q, k, v$ lives in 64 dimensions. [The Illustrated Transformer]

◆ Intuition

Why three separate projections and not just the raw embeddings? The roles are different. A query asks "what am I looking for?", a key advertises "here is what I offer", and a value carries "here is what you get if you pick me". Letting the model learn three views means a token can advertise one thing as a key while contributing something else as a value. One matrix could not separate those jobs.

3.Scaled dot-product attention

Now the whole layer in one line. To find how much token $i$ should attend to token $j$, take the dot product of query $i$ with key $j$. Collect every pair into the score matrix $QK^\top$, scale it, softmax each row, and use those weights to average the values:

$$\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$

The softmax runs along the key axis, so each query's row of weights sums to 1. Row $i$ of the output is a weighted blend of all value vectors, the blend that token $i$ asked for.

The one piece that trips people up is the $\sqrt{d_k}$. Here is the reason, and it is a favourite exam question. Suppose the query and key entries are independent with mean 0 and variance 1. A dot product of two $d_k$-dimensional vectors sums $d_k$ such products, so it has mean 0 and variance $d_k$. The scores therefore grow with dimension. Feed large scores into a softmax and it saturates: almost all the weight piles onto one key and the gradient through the softmax goes to nearly zero, which stalls learning. [Vaswani et al., §3.2.1]

Dividing by $\sqrt{d_k}$ rescales the variance of each score back to 1, no matter how large $d_k$ gets, keeping the softmax in its responsive range. [Dive into Deep Learning, 11.3]

▲ Exam-favourite numbers

For the base Transformer, $d_{\text{model}}=512$ split across $h=8$ heads gives $d_k=d_v=512/8=64$, so the scale factor is $\sqrt{64}=8$. Two unit-variance vectors of length 64 produce a dot product with variance 64; dividing by 8 brings it back to variance 1. Memorise the chain: $d_k=64 \Rightarrow \sqrt{d_k}=8$.

Attention is a soft, differentiable dictionary lookup: scores from $QK^\top$, weights from softmax, output from a value blend. The $\sqrt{d_k}$ keeps the softmax awake.

4.Watch attention happen

The simulator runs scaled dot-product attention on a five-token sentence. The heatmap is the weight matrix: row $i$, column $j$ is how much query token $i$ attends to key token $j$, and every row sums to 1. Click a row label to see that token's full attention distribution. Switch heads and watch the pattern change: one head tracks the neighbour to the left, one links the subject to its verb, one stays diffuse. That is the point of multiple heads, and the next section explains it.

Two things to notice. Turn the temperature down (a smaller scale, like dividing by 1 instead of $\sqrt{d_k}$) and the rows collapse toward a single bright cell: that is softmax saturation, the failure the scaling prevents. Switch heads and the bright cells move to different columns: each head has learned a different notion of "relevant".

5.Multi-head attention

A single attention layer produces one weight pattern per query. But a word relates to its context in several ways at once: grammatically to its verb, by reference to an earlier noun, by topic to a distant phrase. One softmax cannot represent all of those, because each row of weights has to sum to 1 and can only point firmly in a few directions.

The fix is to run $h$ attention computations in parallel, each with its own $W^Q_i, W^K_i, W^V_i$ projecting into a smaller $d_k = d_{\text{model}}/h$ subspace. Each head attends in its own way. Concatenate the $h$ outputs and pass them through one more matrix $W^O$:

$$\text{MultiHead}(Q,K,V) = [\text{head}_1;\dots;\text{head}_h]\,W^O,\qquad \text{head}_i = \text{Attention}(QW^Q_i, KW^K_i, VW^V_i)$$

The shapes work out so multi-head attention costs about the same as one full-width head: eight heads of width 64 concatenate back to 512. [The Illustrated Transformer]

● Worked size

Base model: $d_{\text{model}}=512$, $h=8$. Each head uses $d_k=d_v=64$. After eight heads of 64 you concatenate to $8\times 64 = 512$, then $W^O$ mixes them. So the layer goes 512 in, 512 out, and the per-head 64 is the same number that sets the scale factor $\sqrt{64}=8$.

One more variant matters for the decoder. In a generator like GPT, token $i$ must not peek at tokens that come after it, or training would leak the answer. Masked self-attention sets the scores for future positions to $-\infty$ before the softmax, so their weights become exactly 0. The encoder sees the whole input and needs no mask. [Vaswani et al., §3.2.3]

6.Positional encoding

Attention has a quiet flaw. Shuffle the tokens and the set of scores is unchanged, because a dot product does not care about order. The layer is permutation-invariant, so on its own it would read "dog bites man" and "man bites dog" the same way. We have to inject position by hand.

The original Transformer adds a fixed sinusoidal signal to each input embedding. For position $pos$ and dimension index $i$:

$$PE_{(pos,\,2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right),\qquad PE_{(pos,\,2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$

Each dimension is a sine or cosine of a different frequency. The wavelengths run as a geometric progression from $2\pi$ up to about $10000\cdot 2\pi$, so low dimensions oscillate fast and high dimensions oscillate slowly, like the columns of a binary counter. [Kazemnejad, Positional Encoding]

Why sine and cosine as a pair, rather than just feeding in the integer position? Because for any fixed offset $k$, the encoding of position $pos+k$ is a fixed linear function (a 2D rotation) of the encoding of $pos$. That lets the model reason about relative position with a learned linear map, and because the signal is bounded it extends to sequence lengths never seen in training. The bottom panel of the simulator plots two of these dimensions so you can see the fast and slow waves.

▲ Exam-favourite numbers

Three facts the exam likes to test together: attention is permutation-invariant, so removing positional encoding makes "dog bites man" and "man bites dog" identical to the model; the encoding is added to the embedding, not concatenated; and the base of the geometric progression is $10000$.

7.In NumPy: attention from scratch

The exam's coding section asks for exactly this function. It is short because the formula is short. The only subtlety is doing the softmax in a numerically stable way (subtract the row max) and applying the mask by adding $-\infty$ before the softmax.

attention.pyscaled dot-product attention, with an optional mask

import numpy as np

def softmax(x, axis=-1):
    x = x - x.max(axis=axis, keepdims=True)   # stability: subtract the row max
    e = np.exp(x)
    return e / e.sum(axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    # Q: (n_queries, d_k)  K: (n_keys, d_k)  V: (n_keys, d_v)
    # mask: (n_queries, n_keys) bool, True = mask out (set to -inf)
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)           # (n_queries, n_keys)
    if mask is not None:
        scores = np.where(mask, -np.inf, scores)
    weights = softmax(scores, axis=-1)        # each row sums to 1
    output = weights @ V                      # (n_queries, d_v)
    return output, weights

# Verify on the exam's test sentence
Q = np.array([[1, 0, 1], [0, 1, 0]], float)
K = np.array([[1, 1, 0], [0, 1, 1], [1, 0, 1]], float)
V = np.array([[1, 2], [3, 0], [0, 1]], float)
out, w = scaled_dot_product_attention(Q, K, V)
print(w.round(3))        # [[0.264 0.264 0.471] [0.39  0.39  0.219]]
print(w.sum(axis=1))     # [1. 1.]  -> rows sum to 1
print(out.round(3))      # [[1.058 1.   ] [1.562 1.   ]]

The shapes are the whole story: scores are $n\times n$, weights are $n\times n$ with rows summing to 1, and the output is $n\times d_v$, one blended value vector per query. If your weights do not sum to 1 along the key axis, your softmax is on the wrong axis.

The self-attention layer in five steps

Project: $Q=XW^Q$, $K=XW^K$, $V=XW^V$ from the same input $X$.
Score: $S = QK^\top$, every query against every key.
Scale: divide by $\sqrt{d_k}$ to hold the variance at 1 and keep softmax responsive.
Weight: softmax each row (mask future positions to $-\infty$ first in a decoder).
Blend: output $= \text{weights}\cdot V$, a weighted average of values per query.

8.Quick check

Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.

Primary source to read next: Vaswani et al. (2017), "Attention Is All You Need", sections 3.1 to 3.3, then Jay Alammar's The Illustrated Transformer for the picture version and Lilian Weng's attention survey for the math. Stuck on why the softmax saturates, or want to grind the $d_k=64\Rightarrow\sqrt{d_k}=8$ variance argument by hand? Ask me and we will work it out from the dot-product variance up.