Sequences · Week 6 · Session 2

Seq2Seq & Attention

One sentence will not fit through one vector. Attention is the decoder learning to look back.

~20 min read Exam weight: heavy on the RNN viva and the NumPy attention code Builds on LSTMs, GRUs & Gating

An encoder-decoder reads a source sentence with one RNN and writes the target with another. The two are joined by a single context vector, the encoder's last hidden state. That works until the input gets long, and then it cracks: you cannot pour a 40-word sentence through one fixed vector $c \in \mathbb{R}^d$ without losing the early words. Attention removes that funnel. At every output step the decoder builds a fresh context $c_t = \sum_i \alpha_{ti}\, h_i$, a weighted blend of all encoder states, with the weights $\alpha_{ti}$ learned end to end.

1.The problem: turning one sequence into another

Many tasks map a sequence to another sequence of a different length: translate "the cat sat" to "le chat assis", summarize a paragraph into a headline, convert a date string into ISO format. A plain RNN classifier cannot do this, because input and output lengths differ and there is no fixed alignment between positions.

Sutskever et al. (2014) gave the standard recipe: two RNNs. The encoder reads the source one token at a time and ends with a hidden state. The decoder starts from that state and generates the target one token at a time, feeding each output back in as the next input until it emits a stop symbol.

⊳ First principles

The only wire from encoder to decoder is the final hidden state. Everything the decoder ever knows about the source has to be packed into that one vector. So the design question is simple: how much can a single fixed-size vector remember? The answer is "enough for short inputs, not enough for long ones," and that single fact drives the whole lesson.

2.The bottleneck: a whole sentence in one vector

Call the encoder's final state $c$. Its size $d$ is fixed: 512, 1024, whatever you chose. A 5-word sentence and a 50-word sentence both have to fit in the same $d$ numbers. As the input grows, you are compressing more and more meaning into a constant budget, so detail gets crushed. The handout's image is exact: it is like summarizing a book onto a single sticky note.

There is a second, sharper reason the early words fade. The encoder processes left to right, so word 1 sits dozens of recurrent steps away from $c$. Even with an LSTM, information from the start of a long sentence has to survive a long walk to reach the context vector, and a lot of it does not.

▲ Exam-favourite numbers

Bahdanau et al. (2015) measured this directly. A fixed-vector encoder-decoder (their RNNencdec) shows BLEU that falls off sharply as sentences get longer, while the attention model (RNNsearch) stays roughly flat. Their RNNsearch-50 reached BLEU 36.15 on English to French, edging past the Moses phrase-based system at 35.63. The exam likes one phrase for this: the information bottleneck of the fixed-length context vector. (arxiv.org/abs/1409.0473)

3.Attention: let the decoder look back

The fix is almost embarrassingly direct. Stop forcing the decoder to rely on one vector. Keep all the encoder hidden states $h_1, \dots, h_n$ around, and at each decoding step let the decoder decide which of them to read. Like an open-book exam instead of a closed-book one.

At decoder step $t$, with the decoder's current state $s_t$, attention runs three steps (this is the answer the viva wants, word for word):

The three steps of attention at step $t$

Score every source position: $e_{ti} = \text{score}(s_t, h_i)$ for $i = 1\dots n$. How relevant is source word $i$ to what I am writing now?
Normalize the scores into weights with a softmax: $\alpha_{ti} = \dfrac{\exp(e_{ti})}{\sum_{j}\exp(e_{tj})}$. The weights are non-negative and sum to 1.
Blend: take the weighted sum of the encoder states, $c_t = \sum_{i} \alpha_{ti}\, h_i$. This fresh context feeds the decoder's output for step $t$.

Stack the weight vectors $\alpha_t$ for every output step and you get the alignment heatmap: target words on the rows, source words on the columns, brightness equal to $\alpha_{ti}$. A bright diagonal means the model copies word order; bright off-diagonal cells mean it reordered, which is exactly what happens when English and French (or German verb placement) disagree on order.

The weights are never told to a human-labelled alignment. They are learned end to end from the translation loss alone. That is why we call it soft attention: the context is a smooth weighted average, so it is differentiable, and backprop flows straight through the softmax into the scores. A hard alternative would pick one source word with an argmax, which has a zero gradient almost everywhere and needs reinforcement-learning tricks to train.

◆ Intuition

Soft attention is a weighted average; hard attention is a single pick. The average is differentiable because nudging a score nudges the weights smoothly, so $\partial L / \partial e_{ti}$ exists everywhere. The argmax in hard attention is flat between jumps, so its gradient is zero almost everywhere. That single property, differentiability, is why every modern model uses soft attention.

4.Scoring: Bahdanau vs Luong

The three steps are settled. The only thing left to choose is the score function in step 1. Two classic answers come up on the exam.

Bahdanau (additive, 2015) runs a tiny one-hidden-layer network over the two states:

$$e_{ti} \;=\; v_a^\top \tanh\!\big(W_a\,[\,s_t;\,h_i\,]\big)$$

It concatenates $s_t$ and $h_i$, mixes them through a learned matrix $W_a$ and a $\tanh$, then projects to a scalar with $v_a$. More parameters, more expressive, and it works even when the decoder and encoder states have different sizes.

Luong (multiplicative, 2015) replaces that network with a dot product:

$$e_{ti} \;=\; s_t^\top h_i \quad(\text{dot})\qquad\text{or}\qquad e_{ti} \;=\; s_t^\top W_a\, h_i \quad(\text{general})$$

The dot form has no parameters at all and is a single matrix multiply over the whole source, so it is fast and scales well to large hidden sizes. The price: $s_t$ and $h_i$ must share a dimension.

Same three steps, different score in step 1. Bahdanau learns a small network (slow, flexible); Luong uses a dot product (fast, needs matching dims). Prefer additive for small models, dot-product for large ones.

5.The query-key-value view

Here is the framing that carries straight into next week's Transformer. Rename the three actors. The decoder state asking the question is the query $q$. Each source state, in its role of advertising "here is what I am about," is a key $k_i$. Each source state, in its role of supplying content to retrieve, is a value $v_i$. In plain seq2seq, keys and values are the same encoder states $h_i$.

Attention is then a soft dictionary lookup. A hard dictionary returns the value at the one key that matches exactly. Soft attention returns a weighted mix of all values, weighted by how well the query matches each key:

$$\text{Attention}(q, K, V) \;=\; \sum_i \text{softmax}_i\big(\text{score}(q, k_i)\big)\, v_i$$

The Transformer picks the dot-product score, scales it, and stacks queries into a matrix so the whole thing is two matrix multiplies and a softmax. We derive that next week. For now the takeaway is that the seq2seq attention you just learned is the Transformer's attention, with the encoder states playing both keys and values.

● Worked mapping

Translating "the cat sat" word by word: when the decoder writes "chat", its state $s_t$ is the query. It scores against the keys (encoder states for "the", "cat", "sat"), the softmax puts most weight on "cat", and the returned value is mostly "cat"'s encoder state. Query = decoder state, keys = values = encoder states, weights = similarity of query to each key.

6.In NumPy: scaled dot-product attention

This exact function is the recurring coding question on the project viva, so write it until it is muscle memory. It takes a stack of queries, keys, and values, scores every query against every key, scales, softmaxes each row, and returns the weighted values plus the weight matrix.

Why divide by $\sqrt{d_k}$? If query and key entries are independent with variance 1, the dot product $q^\top k$ has variance $d_k$. For large $d_k$ the raw scores get big, the softmax saturates, and its gradient nearly vanishes. Dividing by $\sqrt{d_k}$ pulls the variance back to 1, so the softmax stays in a responsive range. (Vaswani et al. 2017, Section 3.2.1)

attention.pysoftmax(QKᵀ / √dₖ) V, with an optional mask

import numpy as np

def softmax(x, axis=-1):
    x = x - np.max(x, axis=axis, keepdims=True)   # stable: subtract row max
    e = np.exp(x)
    return e / np.sum(e, axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V, mask=None):
    # Q: (n_queries, d_k)   K: (n_keys, d_k)   V: (n_keys, d_v)
    # mask: (n_queries, n_keys) bool, True = block this pair
    # returns: output (n_queries, d_v), weights (n_queries, n_keys)
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)               # (n_queries, n_keys)
    if mask is not None:
        scores = np.where(mask, -1e9, scores)     # push masked logits to -inf
    weights = softmax(scores, axis=-1)            # each row sums to 1
    output = weights @ V                          # weighted blend of values
    return output, weights

Run the viva's own test: Q=[[1,0,1],[0,1,0]], K=[[1,1,0],[0,1,1],[1,0,1]], V=[[1,2],[3,0],[0,1]]. You get an output of shape (2, 2), weights of shape (2, 3), and every weight row sums to 1. The masking line is the only subtle part: set masked logits to a large negative number before the softmax so their weights come out at essentially 0.

7.Quick check

Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.

Primary source to read next: Bahdanau, Cho & Bengio (2015), Neural Machine Translation by Jointly Learning to Align and Translate, Section 3 is the alignment model. Pair it with Jay Alammar's Visualizing A Neural Machine Translation Model for the picture, and Lilian Weng's Attention? Attention! for every score variant in one place. Stuck on why the three steps are score, softmax, blend, or on the $\sqrt{d_k}$ scaling? Ask me and we will trace the viva's test vectors by hand.