Seq2Seq & Attention
One sentence will not fit through one vector. Attention is the decoder learning to look back.
An encoder-decoder reads a source sentence with one RNN and writes the target with another. The two are joined by a single context vector, the encoder's last hidden state. That works until the input gets long, and then it cracks: you cannot pour a 40-word sentence through one fixed vector $c \in \mathbb{R}^d$ without losing the early words. Attention removes that funnel. At every output step the decoder builds a fresh context $c_t = \sum_i \alpha_{ti}\, h_i$, a weighted blend of all encoder states, with the weights $\alpha_{ti}$ learned end to end.
1.The problem: turning one sequence into another
Many tasks map a sequence to another sequence of a different length: translate "the cat sat" to "le chat assis", summarize a paragraph into a headline, convert a date string into ISO format. A plain RNN classifier cannot do this, because input and output lengths differ and there is no fixed alignment between positions.
Sutskever et al. (2014) gave the standard recipe: two RNNs. The encoder reads the source one token at a time and ends with a hidden state. The decoder starts from that state and generates the target one token at a time, feeding each output back in as the next input until it emits a stop symbol.
The only wire from encoder to decoder is the final hidden state. Everything the decoder ever knows about the source has to be packed into that one vector. So the design question is simple: how much can a single fixed-size vector remember? The answer is "enough for short inputs, not enough for long ones," and that single fact drives the whole lesson.
2.The bottleneck: a whole sentence in one vector
Call the encoder's final state $c$. Its size $d$ is fixed: 512, 1024, whatever you chose. A 5-word sentence and a 50-word sentence both have to fit in the same $d$ numbers. As the input grows, you are compressing more and more meaning into a constant budget, so detail gets crushed. The handout's image is exact: it is like summarizing a book onto a single sticky note.
There is a second, sharper reason the early words fade. The encoder processes left to right, so word 1 sits dozens of recurrent steps away from $c$. Even with an LSTM, information from the start of a long sentence has to survive a long walk to reach the context vector, and a lot of it does not.
Bahdanau et al. (2015) measured this directly. A fixed-vector encoder-decoder (their RNNencdec) shows BLEU that falls off sharply as sentences get longer, while the attention model (RNNsearch) stays roughly flat. Their RNNsearch-50 reached BLEU 36.15 on English to French, edging past the Moses phrase-based system at 35.63. The exam likes one phrase for this: the information bottleneck of the fixed-length context vector. (arxiv.org/abs/1409.0473)
3.Attention: let the decoder look back
The fix is almost embarrassingly direct. Stop forcing the decoder to rely on one vector. Keep all the encoder hidden states $h_1, \dots, h_n$ around, and at each decoding step let the decoder decide which of them to read. Like an open-book exam instead of a closed-book one.
At decoder step $t$, with the decoder's current state $s_t$, attention runs three steps (this is the answer the viva wants, word for word):
- Score every source position: $e_{ti} = \text{score}(s_t, h_i)$ for $i = 1\dots n$. How relevant is source word $i$ to what I am writing now?
- Normalize the scores into weights with a softmax: $\alpha_{ti} = \dfrac{\exp(e_{ti})}{\sum_{j}\exp(e_{tj})}$. The weights are non-negative and sum to 1.
- Blend: take the weighted sum of the encoder states, $c_t = \sum_{i} \alpha_{ti}\, h_i$. This fresh context feeds the decoder's output for step $t$.
Stack the weight vectors $\alpha_t$ for every output step and you get the alignment heatmap: target words on the rows, source words on the columns, brightness equal to $\alpha_{ti}$. A bright diagonal means the model copies word order; bright off-diagonal cells mean it reordered, which is exactly what happens when English and French (or German verb placement) disagree on order.
The weights are never told to a human-labelled alignment. They are learned end to end from the translation loss alone. That is why we call it soft attention: the context is a smooth weighted average, so it is differentiable, and backprop flows straight through the softmax into the scores. A hard alternative would pick one source word with an argmax, which has a zero gradient almost everywhere and needs reinforcement-learning tricks to train.
Soft attention is a weighted average; hard attention is a single pick. The average is differentiable because nudging a score nudges the weights smoothly, so $\partial L / \partial e_{ti}$ exists everywhere. The argmax in hard attention is flat between jumps, so its gradient is zero almost everywhere. That single property, differentiability, is why every modern model uses soft attention.
4.Scoring: Bahdanau vs Luong
The three steps are settled. The only thing left to choose is the score function in step 1. Two classic answers come up on the exam.
Bahdanau (additive, 2015) runs a tiny one-hidden-layer network over the two states:
$$e_{ti} \;=\; v_a^\top \tanh\!\big(W_a\,[\,s_t;\,h_i\,]\big)$$
It concatenates $s_t$ and $h_i$, mixes them through a learned matrix $W_a$ and a $\tanh$, then projects to a scalar with $v_a$. More parameters, more expressive, and it works even when the decoder and encoder states have different sizes.
Luong (multiplicative, 2015) replaces that network with a dot product:
$$e_{ti} \;=\; s_t^\top h_i \quad(\text{dot})\qquad\text{or}\qquad e_{ti} \;=\; s_t^\top W_a\, h_i \quad(\text{general})$$
The dot form has no parameters at all and is a single matrix multiply over the whole source, so it is fast and scales well to large hidden sizes. The price: $s_t$ and $h_i$ must share a dimension.
5.The query-key-value view
Here is the framing that carries straight into next week's Transformer. Rename the three actors. The decoder state asking the question is the query $q$. Each source state, in its role of advertising "here is what I am about," is a key $k_i$. Each source state, in its role of supplying content to retrieve, is a value $v_i$. In plain seq2seq, keys and values are the same encoder states $h_i$.
Attention is then a soft dictionary lookup. A hard dictionary returns the value at the one key that matches exactly. Soft attention returns a weighted mix of all values, weighted by how well the query matches each key:
$$\text{Attention}(q, K, V) \;=\; \sum_i \text{softmax}_i\big(\text{score}(q, k_i)\big)\, v_i$$
The Transformer picks the dot-product score, scales it, and stacks queries into a matrix so the whole thing is two matrix multiplies and a softmax. We derive that next week. For now the takeaway is that the seq2seq attention you just learned is the Transformer's attention, with the encoder states playing both keys and values.
Translating "the cat sat" word by word: when the decoder writes "chat", its state $s_t$ is the query. It scores against the keys (encoder states for "the", "cat", "sat"), the softmax puts most weight on "cat", and the returned value is mostly "cat"'s encoder state. Query = decoder state, keys = values = encoder states, weights = similarity of query to each key.
6.In NumPy: scaled dot-product attention
This exact function is the recurring coding question on the project viva, so write it until it is muscle memory. It takes a stack of queries, keys, and values, scores every query against every key, scales, softmaxes each row, and returns the weighted values plus the weight matrix.
Why divide by $\sqrt{d_k}$? If query and key entries are independent with variance 1, the dot product $q^\top k$ has variance $d_k$. For large $d_k$ the raw scores get big, the softmax saturates, and its gradient nearly vanishes. Dividing by $\sqrt{d_k}$ pulls the variance back to 1, so the softmax stays in a responsive range. (Vaswani et al. 2017, Section 3.2.1)
import numpy as np
def softmax(x, axis=-1):
x = x - np.max(x, axis=axis, keepdims=True) # stable: subtract row max
e = np.exp(x)
return e / np.sum(e, axis=axis, keepdims=True)
def scaled_dot_product_attention(Q, K, V, mask=None):
# Q: (n_queries, d_k) K: (n_keys, d_k) V: (n_keys, d_v)
# mask: (n_queries, n_keys) bool, True = block this pair
# returns: output (n_queries, d_v), weights (n_queries, n_keys)
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k) # (n_queries, n_keys)
if mask is not None:
scores = np.where(mask, -1e9, scores) # push masked logits to -inf
weights = softmax(scores, axis=-1) # each row sums to 1
output = weights @ V # weighted blend of values
return output, weights
Run the viva's own test: Q=[[1,0,1],[0,1,0]], K=[[1,1,0],[0,1,1],[1,0,1]], V=[[1,2],[3,0],[0,1]]. You get an output of shape (2, 2), weights of shape (2, 3), and every weight row sums to 1. The masking line is the only subtle part: set masked logits to a large negative number before the softmax so their weights come out at essentially 0.
7.Quick check
Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.