Sequences · Week 6 · Session 1

LSTMs, GRUs & Gating

A memory that survives a thousand steps, built from three sigmoids and one addition.

~20 min read Exam weight: heavy on the RNN section and the viva Builds on Vanilla RNNs & BPTT

A vanilla RNN updates its memory by multiplying it through a weight matrix at every step. Run that for 200 steps and the gradient gets multiplied by the same matrix 200 times, so it shrinks toward zero (or blows up). The signal from step 1 never reaches step 200. The LSTM's fix is almost embarrassingly simple: stop multiplying the memory, and start adding to it. The cell state update is $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$, and that single plus sign is what keeps gradients alive across hundreds of timesteps.

1.The problem: a vanilla RNN forgets

A vanilla RNN carries one hidden state and rewrites it from scratch each step:

$$h_t = \tanh(W_{hh}\,h_{t-1} + W_{xh}\,x_t + b)$$

To learn from a long sequence, backprop through time has to send the gradient from the loss at step $T$ all the way back to step $1$. Following the chain rule, that gradient picks up one factor of $\partial h_{t}/\partial h_{t-1}$ per step, so the full path is a product of about $T$ Jacobians. Each Jacobian contains the recurrent matrix $W_{hh}$ and a $\tanh'$ term that is at most $1$.

Multiply a number a hundred times and only two things happen. If the dominant eigenvalue of $W_{hh}$ is below $1$, the product collapses to zero: the vanishing gradient. If it is above $1$, the product explodes. There is no comfortable middle, so the early steps get a learning signal of essentially zero (Wikipedia, Recurrent neural network).

⊳ First principles

Vanishing gradients are not a bug in RNNs, they are arithmetic. Repeated multiplication of a quantity smaller than $1$ always decays geometrically, the same way $0.9^{100}\approx 0.00003$. If memory survives only by being multiplied through a matrix every step, long memory is impossible. The question becomes: can memory survive without being multiplied each step?

2.The fix: an additive cell state

The LSTM keeps two things that flow forward through time: the usual hidden state $h_t$, and a second vector called the cell state $C_t$. The cell state is the trick. Its update has no matrix in the recurrent path:

$$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$

Read it as: keep a fraction $f_t$ of the old memory, then add a fraction $i_t$ of a new candidate $\tilde{C}_t$. The $\odot$ is element-wise multiply, so each coordinate of memory is handled on its own. Colah calls the cell state a conveyor belt: "It runs straight down the entire chain, with only some minor linear interactions. It's very easy for information to just flow along it unchanged" (Olah, Understanding LSTM Networks).

Picture a conveyor belt running across timesteps. The forget gate is a dial on the belt that decides how much of the current cargo continues, and the input gate is an on-ramp that drops new cargo on. If you set the dial to $1$ and the on-ramp to $0$, the cargo rides untouched for as long as you like.

The cell state carries memory by addition, not multiplication. That one design choice is the whole reason an LSTM remembers.

Slide the gates and watch a single number of memory ride the belt across timesteps. Crank the forget gate near $1$ and the original value persists; drop it toward $0$ and the belt erases itself. Compare against the vanilla RNN line, which decays no matter what:

3.Three gates, one job each

A gate is just a sigmoid layer. Sigmoid squashes any input into $(0,1)$, so multiplying by a gate value of $0$ blocks a signal and a value of $1$ passes it through. That is the entire mechanism. The LSTM uses three of them, each reading the same input: the current input $x_t$ and the previous hidden state $h_{t-1}$.

The LSTM cell, in order

Forget gate (what to erase): $f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)$
Input gate (how much to write): $i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)$
Candidate (what to write): $\tilde{C}_t = \tanh(W_c x_t + U_c h_{t-1} + b_c)$
Cell update (the belt): $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$
Output gate (what to expose): $o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)$
Hidden state (the readout): $h_t = o_t \odot \tanh(C_t)$

The three gates use sigmoid because a gate has to answer "how much, on a scale of 0 to 1". The candidate and the final squash use $\tanh$ because they carry signed content, values in $(-1,1)$ that can push memory up or down (d2l.ai, ch. 10.1). Notice the split of labour: the cell state $C_t$ is the long-term store, while the hidden state $h_t$ is a filtered view of it that the output gate decides to reveal.

◆ Intuition

Why a separate output gate? Sometimes the network wants to hold a fact in memory but not act on it yet. Think of reading "The keys are on the" and keeping "keys" in the cell state, but only releasing it when the sentence later asks "where are the ___?". The cell remembers; the output gate controls when that memory becomes visible to the rest of the network.

4.Why the gradient survives

Now the payoff. During backprop the gradient travels back along the cell state, and at each step it gets multiplied by $\partial C_t / \partial C_{t-1}$. From the additive update, that derivative is just the forget gate:

$$\frac{\partial C_t}{\partial C_{t-1}} = f_t$$

No weight matrix in that path, just an element-wise number between $0$ and $1$ that the network learns to set. Over $T$ steps the gradient scales like the product $f_T \, f_{T-1}\cdots f_1$. If the network learns to keep $f_t \approx 1$ for the steps that matter, the product stays near $1$ and the gradient flows back undamaged. This is the constant error carousel: error circulates through the cell with little attenuation (Wikipedia, Long short-term memory).

Contrast the two products. Vanilla RNN: a product of Jacobians built from $W_{hh}$, which the network cannot easily keep near $1$. LSTM: a product of forget gates, which the network can drive to $1$ by raising the bias $b_f$. The gate is the steering wheel the RNN never had.

▲ Exam-favourite numbers

Forget gate all $1$s every step: $C_t = C_{t-1} + i_t \odot \tilde{C}_t$, memory is preserved perfectly, the LSTM never forgets. Forget gate all $0$s: $C_t = i_t \odot \tilde{C}_t$, the old memory is wiped every step, so it acts like a memoryless network. The key derivative the viva asks for: $\partial C_t / \partial C_{t-1} = f_t$. And yes, an LSTM can still vanish if it learns $f_t \ll 1$ on a span it should have remembered.

5.The GRU: same idea, fewer parts

Cho and colleagues asked whether the LSTM needs all that machinery. The Gated Recurrent Unit drops the separate cell state and uses two gates instead of three. There is no $C_t$, just the hidden state $h_t$, and one update gate $z_t$ handles both forgetting and writing in a single move:

$$h_t = (1 - z_t)\odot h_{t-1} + z_t \odot \tilde{h}_t$$

The update gate is a slider between old and new. When $z_t = 0$ the state passes through untouched, $h_t = h_{t-1}$, a perfect identity mapping that carries the gradient unchanged; when $z_t = 1$ the state is fully replaced by the candidate. The forget and input gates of the LSTM are tied together here: whatever you do not keep, you overwrite. A second gate, the reset gate $r_t$, decides how much past state feeds into the candidate (d2l.ai, ch. 10.2):

The GRU cell

Reset gate: $r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r)$
Update gate: $z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z)$
Candidate: $\tilde{h}_t = \tanh\!\big(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h\big)$
Blend: $h_t = (1 - z_t)\odot h_{t-1} + z_t \odot \tilde{h}_t$

● Counting the savings

A recurrent gate is a small linear layer over the concatenation $[h_{t-1}, x_t]$. With hidden size $k$ and input size $m$, one gate costs $k(k+m+1)$ parameters. The LSTM has four such blocks (3 gates + the candidate), so $4k(k+m+1)$. The GRU has three (2 gates + the candidate), so $3k(k+m+1)$. That is exactly $3/4$, a 25% cut, with similar accuracy on most tasks (Chung et al. 2014).

One convention warning for the exam: some texts write the blend as $h_t = z_t \odot h_{t-1} + (1-z_t)\odot \tilde{h}_t$, swapping which side $z_t$ sits on. The math is identical, only the name "keep vs replace" flips. Priyansh's handout uses the form above where $z_t = 0$ means keep, so read the question's definition before you answer.

6.Bidirectional and stacked

Two cheap upgrades show up constantly. A bidirectional RNN runs one cell left to right and a second cell right to left, then concatenates their states, so every position sees both past and future context. It helped power the 2-layer bidirectional LSTM that was the NLP workhorse from 2015 to 2017.

The catch: bidirectional needs the whole sequence up front. For a real-time speech-to-text system the audio arrives one chunk at a time, so the backward pass has nothing to read yet. Use a unidirectional model there. Stacking is the other lever: feed one LSTM's outputs into a second LSTM. Lower layers pick up local patterns, higher layers compose them into longer-range structure. Two to four layers is the usual range.

▲ Exam trap

"Should a streaming speech recogniser be bidirectional?" No. Streaming means you do not have the full sequence, and the backward direction requires it. This is a stock viva question, and the answer is the streaming constraint, not accuracy.

7.In NumPy: one LSTM step from scratch

The whole cell is six lines once you have the concatenation trick. Stack $h_{t-1}$ and $x_t$ into one vector, run four linear maps, apply the gates. Here is a single forward step, the thing the coding viva asks you to write:

lstm_step.pythree sigmoids, one tanh, one additive update

import numpy as np

def sigmoid(z): return 1.0 / (1.0 + np.exp(-z))

def lstm_step(x, h_prev, C_prev, W, b):
    # x: (m,)  h_prev, C_prev: (k,)
    # W: (4k, k+m) stacked [f, i, c, o];  b: (4k,)
    z  = np.concatenate([h_prev, x])      # the [h_{t-1}, x_t] trick
    a  = W @ z + b                        # one matmul for all four blocks
    k  = h_prev.shape[0]
    f  = sigmoid(a[0:k])                  # forget gate
    i  = sigmoid(a[k:2*k])                # input gate
    g  = np.tanh(a[2*k:3*k])              # candidate  C-tilde
    o  = sigmoid(a[3*k:4*k])              # output gate

    C  = f * C_prev + i * g               # the additive conveyor belt
    h  = o * np.tanh(C)                   # filtered readout
    return h, C

The line that matters is C = f * C_prev + i * g. Set f to all ones and the cell state only grows by additions, never decays, so a value written at step 0 is still there at step 1000. That is the gradient highway in one expression. To see it on a long-memory task, set the candidate to the input at step 0 and feed noise afterward; a trained LSTM learns $f \approx 1$ on the channel holding the signal.

8.Quick check

Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.

Primary source to read next: Olah's Understanding LSTM Networks (the diagrams are the clearest anywhere), then the GRU comparison in Chung et al. 2014 and the original Hochreiter & Schmidhuber 1997 for the constant error carousel. Stuck on why $\partial C_t/\partial C_{t-1} = f_t$ kills the vanishing gradient, or on the LSTM-vs-GRU parameter count? Ask me and we will derive both by hand.