Vision · Week 4 · Session 2

CNN Architectures

How the same convolution grew from a 60K-parameter digit reader into a 152-layer network that beats humans on ImageNet.

~20 min read Exam weight: heavy on the match-the-following and numerical sections Builds on Convolution & pooling

In 1998, LeNet-5 read handwritten digits with about 60K parameters. In 2015, ResNet-152 classified a thousand ImageNet categories with a top-5 error of $3.57\%$, below what a careful human reaches. The convolution operation barely changed across those 17 years. What changed was how people stacked it: how deep to go, what filter size to use, and how to keep gradients alive once a network passed 100 layers. This lesson is the story of five designs and the one idea, the skip connection, that broke the depth ceiling.

1.LeNet to AlexNet: scale changes everything

LeNet-5 was the first CNN that worked in production. It had two convolution layers, two subsampling (pooling) layers, and a few fully connected layers on top, around 60K parameters, tanh activations, trained to read zip codes and bank cheques on 28x28 digit images (LeCun et al., 1998, Gradient-Based Learning Applied to Document Recognition). The recipe was already there: local filters, weight sharing, pooling to shrink the spatial size, dense layers at the end.

For 14 years nothing much moved. Then in 2012 AlexNet ran essentially the same recipe at a thousand times the scale: 60M parameters, eight learned layers, trained on 1.2M ImageNet images across two GPUs. It won ILSVRC 2012 with a top-5 error of $15.3\%$ against the runner-up's $26.2\%$, a gap of more than ten points (Krizhevsky et al., 2012, ImageNet Classification with Deep CNNs). That single result is what kicked off the modern deep learning era.

Three changes made the scale-up trainable, and each one is its own exam line:

ReLU instead of tanh, so gradients did not saturate and training ran several times faster.
Dropout in the fully connected layers, to fight overfitting on 60M parameters.
GPUs, which made a network this size finish training in days rather than months.

⊳ First principles

AlexNet did not invent a new operation. It took LeNet's convolution and asked what stops you from making it bigger. The honest answers were saturating activations, overfitting, and compute. ReLU, dropout, and GPUs answered all three. Most architecture progress after this is the same move: find what breaks when you scale, then fix exactly that.

2.Small filters beat big ones

VGG (Simonyan and Zisserman, 2014) asked a sharp question: given a fixed budget, is it better to use one large filter or a stack of small ones? Its answer became the default for every CNN since. Use only $3\times3$ filters, and to see a wider area, stack more of them.

The reason is the receptive field, the patch of the input that one output neuron can see. Each $3\times3$ layer widens that patch by 2. So one layer sees $3\times3$, two stacked layers see $5\times5$, three see $7\times7$. Two stacked $3\times3$ convolutions cover the same area as a single $5\times5$, and three cover the same as a single $7\times7$ (CS231n, ConvNet notes).

So the receptive fields match. The parameters do not. Count the weights for $C$ input and $C$ output channels:

$$\text{three } 3\times3:\;\; 3\cdot(3\cdot3\cdot C^2) = 27C^2 \qquad\text{one } 7\times7:\;\; 7\cdot7\cdot C^2 = 49C^2$$

That is $27/49 \approx 0.55$, so the stack uses about 45% fewer parameters for the same receptive field. And it is more expressive: the stack passes through three ReLUs instead of one, so it can represent more shapes of function. Fewer weights and a richer function, for free.

▲ Exam-favourite numbers

Three stacked $3\times3$ convs: $27C^2$ params, receptive field $7\times7$, 3 nonlinearities. One $7\times7$ conv: $49C^2$ params, same receptive field, 1 nonlinearity. The 45% fewer parameters figure and the two 3x3 = one 5x5 identity both show up in match-the-following.

VGG-16 has 16 weight layers, all built from this rule, and around 138M parameters. Most of those (about 100M) sit in the first fully connected layer, not the convolutions, which is the hint that the next breakthrough would come from cutting the dense head.

A deep stack of tiny filters sees as wide as one big filter, costs fewer weights, and bends more, because it has more nonlinearities along the way.

3.The 1x1 convolution: a channel bottleneck

A $1\times1$ convolution sounds like it does nothing. It looks at one pixel at a time, so it cannot mix neighbours. What it does mix is channels. At each spatial location it takes the $C_\text{in}$ channel values, runs them through a small fully connected layer, and outputs $C_\text{out}$ values. It is a per-pixel linear recombination of feature maps, plus a ReLU.

That makes it a cheap way to change the channel count. GoogLeNet (Szegedy et al., 2014) used it to set up a bottleneck: drop from 256 channels to 64 with a $1\times1$, do the expensive $3\times3$ or $5\times5$ work on the cheap 64-channel tensor, then expand back. The Inception module runs $1\times1$, $3\times3$, and $5\times5$ filters in parallel and concatenates them, with $1\times1$ bottlenecks in front of the costly paths to keep the channel count from exploding when those outputs stack up.

Two design choices let GoogLeNet hit a better ImageNet score than VGG with about 5M parameters, roughly 28 times fewer than VGG's 138M (Wikipedia, Inception architecture):

1x1 bottlenecks before the expensive convolutions, cutting the work per Inception module.
Global average pooling at the head instead of large fully connected layers, which deletes the 100M-parameter dense block that VGG carried.

◆ Intuition

Why not just concatenate the $3\times3$ and $5\times5$ outputs directly and skip the $1\times1$? Because channels would compound at every module. A few Inception blocks in and you would be convolving over a thousand channels, where each $5\times5$ filter costs $25 \times C_\text{in}$ weights. The $1\times1$ reduce is the valve that keeps that number from running away.

4.The degradation problem

By 2015 the field had learned that depth helps. So people kept adding layers, and ran into a wall that looked nothing like the textbook failure modes. He et al. trained a 20-layer plain CNN and a 56-layer plain CNN on CIFAR-10. The 56-layer network had higher training error than the 20-layer one (He et al., 2015, Deep Residual Learning).

Read that carefully, because the exam loves to test whether you can name the failure. The deeper network was worse on the training set, the data it was allowed to memorise. That rules out overfitting, which would show low training error and high test error. They also used batch normalization, so the gradients were not vanishing in the classic sense. This is a third thing: degradation, an optimization failure. The deeper network can represent everything the shallow one can (it could set the extra layers to identity and copy the 20-layer answer), but gradient descent cannot find that solution.

▲ Exam-favourite numbers

56-layer plain net has higher training error than the 20-layer one. This is degradation, not overfitting (overfitting is low train error, high test error) and not plain vanishing gradients (they used batch norm). A1 in the handout: the network cannot even learn the identity mapping in its extra layers.

Here is the puzzle stated as the handout poses it. If a 56-layer network should be at least as good as a 20-layer one (the extra 36 layers could just learn identity and pass the signal through untouched), why does it do worse? The answer is that fitting an identity map with a stack of nonlinear conv-ReLU layers is hard for an optimizer. Nothing in plain SGD pushes a layer toward "do nothing".

5.ResNet: the skip connection

ResNet's fix is one line. Instead of asking a block of layers to compute the target mapping $H(x)$ directly, give it a shortcut that adds the input back:

$$y = F(x) + x$$

where $F(x)$ is the stack of conv-ReLU layers. The block now only has to learn the residual $F(x) = H(x) - x$, the correction on top of the input, rather than the whole function. He et al. argued that if the best thing a block can do is nothing (the identity), then learning $F(x) = 0$ is easy: just drive the weights toward zero, which weight decay already wants to do. Compare that to forcing a stack of nonlinear layers to reproduce $x$ exactly. "If an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers."

The shortcut helps twice. On the forward pass it gives every block a free path to copy its input, so adding layers can never make things worse. On the backward pass it gives the gradient a direct route home. Differentiate $y = F(x) + x$ and the $+x$ contributes a clean $1$ to $\partial y/\partial x$, so the upstream gradient reaches early layers even when the $\partial F/\partial x$ term has shrunk toward zero. That is the "gradient highway".

⊳ First principles

The identity shortcut adds no parameters and no extra multiply-adds. It is just an addition. So when you compare a plain net and a ResNet of the same depth, they have the same compute budget, and the only difference is whether the gradient has a shortcut. That clean comparison is exactly why the result was convincing: skips, and nothing else, removed the degradation.

With skips, depth started helping again. ResNet went to 152 layers and won ILSVRC 2015 with a $3.57\%$ top-5 error, the first result below the rough human benchmark on ImageNet. For the deepest variants, each block uses a bottleneck: a $1\times1$ to cut channels, a $3\times3$ to do the spatial work cheaply, and a $1\times1$ to expand back, the same channel-bottleneck trick from Inception, reused.

▲ Exam-favourite numbers

ResNet: $y = F(x) + x$; 152 layers; $3.57\%$ top-5 ImageNet error; won ILSVRC 2015. Identity shortcut adds zero parameters and zero compute. Bottleneck block is $1\times1 \to 3\times3 \to 1\times1$. Why learning $F(x)=0$ beats learning $H(x)=x$: pushing weights to zero is trivial; fitting identity with nonlinear layers is not.

6.Simulator: depth, degradation, and skips

The two sims below let you feel the whole arc. Slide the depth up on a plain net and watch its training error bottom out and then climb back (degradation). Flip on the skip connection and watch the same depth keep improving, while the inset shows the gradient surviving the trip back to the first layer. The second panel grows the receptive field as you stack $3\times3$ filters. Below them is the static comparison table you should be able to reproduce in the exam.

The error curves are a stylized model of the He et al. degradation experiment, not measured numbers; the shape (plain rises past a depth, ResNet does not) is what matters. The receptive-field and parameter figures are exact.

7.A residual block in NumPy

The whole idea fits in a forward and a backward pass. The forward adds the input back; the backward sends the gradient down both the residual path and the shortcut, then sums them at the fork (the same add-at-a-fork rule from backprop). Watch the + dout: that is the gradient highway in code.

residual_block.pyforward adds x back; backward sends grad down both paths and sums

import numpy as np

def relu(z):
    return np.maximum(0, z)

class ResidualBlock:
    """y = F(x) + x, where F = conv2 . relu . conv1 (matmuls here for clarity)."""
    def __init__(self, W1, W2):
        self.W1, self.W2 = W1, W2          # same in/out width so x and F(x) add

    def forward(self, x):
        self.x  = x
        self.z1 = x @ self.W1
        self.a1 = relu(self.z1)
        self.z2 = self.a1 @ self.W2
        return self.z2 + x                 # the skip: add the input back

    def backward(self, dout):
        # dout is dL/dy. y = F(x) + x, so the gradient forks to F and to the shortcut.
        dF = dout                          # grad into the residual path F(x)
        dW2 = self.a1.T @ dF
        da1 = dF @ self.W2.T
        dz1 = da1 * (self.z1 > 0)          # relu grad: 1 where z1 > 0, else 0
        dW1 = self.x.T @ dz1
        dx_path = dz1 @ self.W1.T
        dx = dx_path + dout                # sum at the fork: F-path grad + shortcut grad
        return dx, dW1, dW2

# Sanity check: if W1 and W2 are ~0, F(x) ~ 0, so forward(x) ~ x (identity)
# and backward returns dx ~ dout: the gradient passes straight through.

The last comment is the punchline. When the block's weights are near zero, forward(x) returns roughly x and backward(dout) returns roughly dout. The block has become a wire. That is why stacking more residual blocks cannot hurt: in the worst case each one learns to do nothing, and the signal and the gradient both pass through clean.

8.Quick check

Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.

Primary source to read next: the ResNet paper itself, He et al. 2015, Deep Residual Learning for Image Recognition. Read Section 4 (the degradation experiment) and Figure 1 first; the whole argument is there in two plots. For the rest of the lineage, the CS231n architecture notes walk LeNet through ResNet with the parameter counts. Stuck on why three 3x3 beats one 7x7, or on the residual-vs-identity argument? Ask me and we will count every weight by hand.