Sequence Modeling: RNNs
A network with a loop: it reads one element at a time and carries a running memory of everything it has seen.
An image is one fixed block of pixels. A sentence is not: "the cat sat" and "the cat that chased the dog sat" are different lengths, and the order carries the meaning. An MLP needs a fixed-size input vector, so it has no clean way to take a stream of words whose length you do not know in advance. A recurrent network solves this by reading one element at a time and keeping a hidden state $h_t$ that summarizes everything seen so far. The same small cell runs at every step, so a 5-word sentence and a 50-word sentence cost the same set of weights.
1.Why an MLP cannot read a sentence
Feed a sentence to a plain MLP and two problems show up at once. First, the input layer has a fixed width, so you must pad or truncate every sentence to one length, which wastes parameters on padding and throws away anything past the cutoff. Second, an MLP treats position 1 and position 7 as unrelated input slots. It has no built-in notion that the word now depends on the words before it.
What we want is a model that reuses the same computation as it walks along the sequence, and that passes a memory forward from one step to the next. That memory is the whole idea.
Order matters and length varies. So instead of one big function over the whole input, build one small function and apply it again and again, threading a state vector through each application. Weight sharing across time is what lets a fixed parameter set handle a sequence of any length, the same way weight sharing across space lets a CNN handle any image size.
2.The vanilla RNN: one cell, reused
At each timestep $t$ the cell takes the current input $x_t$ and the previous hidden state $h_{t-1}$, mixes them, and squashes the result with $\tanh$:
$$h_t = \tanh\!\left(W_{xh}\,x_t + W_{hh}\,h_{t-1} + b_h\right), \qquad \hat{y}_t = W_{hy}\,h_t + b_y$$
Three weight matrices, and every one of them is shared across all timesteps (Karpathy, 2015):
| Matrix | Connects | Role |
|---|---|---|
| $W_{xh}$ | input → hidden | reads the new element into the state |
| $W_{hh}$ | hidden → hidden | updates the memory (this is the recurrence) |
| $W_{hy}$ | hidden → output | reads a prediction out of the state |
The first hidden state starts from $h_0 = 0$. After that, $h_t$ is a learned summary of inputs $x_1$ through $x_t$. The Stanford CS231n notes give the same recurrence and call $h_t$ the network's running state (CS231n RNN notes).
Think of reading a book one word at a time while keeping a one-line mental summary. Each new word updates the summary; you never re-read the whole page. The summary is $h_t$, the update rule is the cell, and the fact that you use the same reading habit on every word is weight sharing.
3.Unrolling: the loop is a deep network in disguise
The loop is convenient to draw but hard to reason about. So we unroll it: lay out one copy of the cell per timestep, left to right, and draw the hidden state flowing along the top. Now it looks like an ordinary feedforward network that happens to be $T$ layers deep and happens to reuse the same weights at every layer.
Slide the sequence length and watch the hidden state evolve. Then turn on the gradient-flow toggle to see what happens during training.
Two things to notice. Going forward, each bar (the hidden state) is computed from the previous bar plus the new input, so information moves rightward step by step. Going backward, the gradient that trains the early steps has to travel through every cell in between, and you will see it shrink as it goes. That shrinking is the whole reason RNNs struggle with long sequences.
4.Backprop through time is just backprop on the unrolled graph
Once the RNN is unrolled, training needs nothing new. Run the forward pass across all $T$ steps, compute the loss (often summed over timesteps), then sweep backward through the unrolled graph with the chain rule. That sweep has a name, backpropagation through time, but it is the same algorithm from the backprop lesson applied to a graph that is deep in time instead of deep in layers.
The one wrinkle is weight sharing. Because $W_{hh}$ appears at every timestep, its gradient is the sum of the contributions from all of them, exactly the add-at-a-fork rule you already know:
$$\frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial W_{hh}}$$
To send the gradient from the loss back to an early step, the chain rule walks it through every hidden-to-hidden step in between. That produces a product of Jacobians, one factor per timestep:
$$\frac{\partial h_t}{\partial h_k} = \prod_{i=k+1}^{t} \frac{\partial h_i}{\partial h_{i-1}} = \prod_{i=k+1}^{t} \operatorname{diag}\!\big(\tanh'(\cdot)\big)\, W_{hh}^{\top}$$
Read that product carefully, because the next section lives or dies on it. The same matrix $W_{hh}$ gets multiplied in once for every step between $k$ and $t$ (Goodfellow, Bengio & Courville, ch. 10).
Each timestep in the unrolled RNN behaves like one layer of a feedforward net. A 50-step sequence is, for backprop, a 50-layer network. The catch the handout flags: it is a 50-layer network that reuses the same weight matrix at every layer, which makes the gradient a power of that one matrix.
5.Why long-range memory breaks
Look again at the product of Jacobians. Two facts about its factors decide everything. The $\tanh'$ term is at most $1$ and is usually well below it, so it can only shrink the signal. And the repeated $W_{hh}$ scales the signal by roughly its largest singular value (its spectral radius) at each step.
So the gradient reaching step $k$ from step $t$ behaves like that scaling factor raised to the power of the gap $t - k$. If the dominant singular value of $W_{hh}$ is below $1$, the product decays exponentially and the gradient vanishes: early inputs get almost no training signal. If it is above $1$, the product blows up and the gradient explodes into NaNs (CS231n RNN notes).
The handout's worked case: if the largest singular value of $W_{hh}$ is $0.9$ and the sequence is $50$ steps long, the gradient that reaches step 1 from step 50 is about $0.9^{50} \approx 0.005$. Only about 0.5% of the signal survives, so the network effectively cannot learn from that early input. In practice a vanilla RNN forgets inputs from more than roughly $20$ steps back. Remember $\tanh' \le 1$ everywhere, which is why repeated application can only make this worse.
Exploding gradients have a cheap fix: clip the gradient norm to a ceiling. Vanishing gradients do not, because you cannot un-multiply by numbers smaller than one. That gap is the entire motivation for the gated cells coming next week, the LSTM and GRU, which add a near-additive path for the gradient to travel along instead of a repeated multiply.
6.Wiring an RNN to a task
The same cell supports several input-output shapes, depending on where you read predictions out (Karpathy, 2015):
| Shape | Read output | Example tasks |
|---|---|---|
| many-to-one | only at the last step | sentiment analysis, document classification |
| many-to-many (aligned) | at every step | part-of-speech tagging, named entity recognition |
| many-to-many (seq2seq) | read sequence after the input ends | machine translation |
| one-to-many | one input, then generate | image captioning |
For many-to-one, you let the RNN consume the whole sentence and use only $h_T$, the final summary, as the feature vector for a classifier. For aligned many-to-many, you attach an output head to every $h_t$ and get one label per token. The cell never changes; only the read-out wiring does.
7.In NumPy: a vanilla RNN forward pass
The forward pass is the recurrence written as a loop. Notice that the same three matrices are used inside the loop at every step, and the hidden state is what carries memory from one iteration to the next.
import numpy as np
def rnn_forward(xs, h0, Wxh, Whh, Why, bh, by):
"""xs: list of T input vectors (each shape (D,)). Returns hidden states and outputs."""
h = h0 # h0 is usually zeros, shape (H,)
hs, ys = [], []
for x in xs: # walk left to right, one element at a time
h = np.tanh(Wxh @ x + Whh @ h + bh) # SAME weights every step
y = Why @ h + by # read a prediction out of the state
hs.append(h); ys.append(y)
return hs, ys # hs[-1] is the final summary h_T
# tiny demo: D=5 inputs, H=16 hidden, sequence length T=20
D, H, T = 5, 16, 20
rng = np.random.default_rng(0)
Wxh = rng.standard_normal((H, D)) * 0.1
Whh = rng.standard_normal((H, H)) * 0.1
Why = rng.standard_normal((1, H)) * 0.1
bh, by = np.zeros(H), np.zeros(1)
xs = [rng.standard_normal(D) for _ in range(T)]
hs, ys = rnn_forward(xs, np.zeros(H), Wxh, Whh, Why, bh, by)
print([round(float(np.linalg.norm(h)), 2) for h in hs]) # hidden-state norm per step
Run it and print the hidden-state norm at each step, exactly the post-lecture exercise. The norm settling to a roughly stable band is the cell finding its working range. Swap $\tanh$ for the identity and scale $W_{hh}$ above $1$, and you will watch the norm blow up instead, which is the forward-pass shadow of the exploding-gradient problem.
8.Quick check
Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.