Vision · Week 4 · Session 1

Convolutions

How a 9-number filter sees an entire image, no matter how big it gets.

~18 min read Exam weight: heavy on the Vision block and the Final Builds on Backpropagation

Feed a tiny $128\times128$ colour photo into one fully-connected layer of 256 neurons and you have already burned 12.6 million weights, before the network learns a single thing. Worse, that layer treats pixel $(0,0)$ and pixel $(64,64)$ as unrelated strangers, even though a cat's ear looks the same wherever it sits in the frame. A convolution fixes both problems with one move: scan the image with a small filter of shared weights. The same 9 numbers, reused everywhere.

1.Images break fully-connected layers

A fully-connected (FC) layer wires every input to every output. That is fine for a 100-feature tabular row. It falls apart on an image, for two separate reasons.

Reason one: the parameter count explodes. Flatten a $128\times128\times3$ image and you get $49{,}152$ inputs. Connect those to a modest layer of 256 neurons and the weight count is

$$128 \times 128 \times 3 \times 256 = 12{,}582{,}912 \text{ weights}.$$

That is one layer. The full handout number to remember: a $128\times128\times3$ image into 256 FC neurons gives 12,582,912 weights, which overfits small datasets and costs a fortune to train. A $224\times224\times3$ image into 1000 neurons would need over 150 million.

Reason two: flattening throws away geometry. Once you flatten, the network has no idea that two pixels were neighbours. It has to relearn, from scratch, that adjacent pixels relate to each other, and it has to relearn that fact separately for every position in the image.

⊳ First principles

Images have two properties an FC layer ignores. Locality: a pixel's meaning comes mostly from its immediate neighbours, not from a pixel 200 rows away. Translation: an edge or a texture means the same thing wherever it appears. A layer built to exploit both will need far fewer weights and will generalise better. That layer is the convolution.

An FC layer asks "how does every pixel relate to every neuron?" A convolution asks a smaller, smarter question: "does this one small pattern appear here? here? here?"

2.The convolution is slide, multiply, sum

Take a small grid of weights, the filter (or kernel), say $3\times3$. Place it over the top-left corner of the image. Multiply each filter weight by the pixel underneath it, add up all 9 products, and write that single number into the output. Then slide the filter one step to the right and repeat. When you reach the edge, drop down a row and start again.

Formally, for filter $K$ over an input patch, each output value is

$$y_{i,j} = \sum_{m}\sum_{n} K_{m,n}\,\cdot\,x_{\,i+m,\,j+n}.$$

That inner double sum is just a dot product between the filter and the patch of image it currently covers. A convolution is a dot product, repeated at every location.

◆ Intuition

Think of the filter as a tiny stencil for one pattern, a vertical edge, a patch of blue, a corner. Where the image matches the stencil, the dot product is large and the output lights up. Where it does not match, the output stays near zero. Sliding the same stencil everywhere is how the layer answers "where does this pattern occur?" across the whole image with one set of 9 weights.

A note on naming. What deep-learning libraries call "convolution" is technically cross-correlation: the real mathematical convolution first flips the kernel. The flip does not matter for learning, because the network just learns the weights in whatever orientation it needs. As Goodfellow's text puts it, learning with cross-correlation instead of true convolution simply gives you a flipped kernel, and the result is identical. (Deep Learning, ch. 9.)

Different filters detect different things. Below you can step a filter across a $7\times7$ input and watch the output feature map fill in, one position at a time. Try the preset edge, blur, and sharpen kernels.

3.One formula sets every output size

How big is the output? Count how many positions the filter can occupy as it slides. With an input of width $W$, a filter of size $F$, padding $P$ on each side, and stride $S$:

$$\text{out} = \left\lfloor \frac{W - F + 2P}{S} \right\rfloor + 1.$$

Read it from the pieces. The filter's left edge can start at position 0 and slide until its right edge hits the (padded) boundary, which leaves $W - F + 2P$ pixels of travel. Divide by the stride to count the steps, and add 1 for the starting position. The same formula runs independently for height. (Source: Stanford CS231n.)

▲ Exam-favourite numbers

The handout's worked case: a $13\times13$ input, a $5\times5$ filter, padding $P=2$, stride $S=2$.

$$\frac{13 - 5 + 2(2)}{2} + 1 = \frac{12}{2} + 1 = 7.$$

So the output is $7\times7$. Memorise the three sub-steps: numerator $W - F + 2P = 12$, divide by stride $=6$, add one $=7$. Examiners love handing you $W,F,P,S$ and asking for the output size.

One trap: the numerator must be divisible by the stride, or the filter does not tile the input cleanly. CS231n's example: $W=10,\,F=3,\,P=0,\,S=2$ gives $(10-3)/2 + 1 = 4.5$, which is not a whole number, so that configuration is invalid. The floor in the formula is what frameworks use to round it down, but you should treat a non-integer result as a sign your hyperparameters do not fit.

4.Padding and stride are your two dials

Without padding, every convolution shrinks the image. A $3\times3$ filter on a $7\times7$ input (stride 1) gives $(7-3)/1 + 1 = 5$, so you lose a ring of pixels. Stack a few layers and the picture vanishes. The fix is zero-padding: ring the input with zeros so the filter can sit over the border pixels too.

The padding that keeps the output the same size as the input is called "same" padding. For stride 1, set

$$P = \frac{F - 1}{2}.$$

For a $3\times3$ filter that is $P=1$; for $5\times5$ it is $P=2$. Check it: $7\times7$ input, $3\times3$ filter, $P=1$, stride 1 gives $(7-3+2)/1 + 1 = 7$. Same size in, same size out. (Source: CS231n.)

Stride is how far the filter jumps each step. Stride 1 looks at every position; stride 2 skips every other one, roughly halving each output dimension and downsampling the feature map. Larger strides mean a smaller, cheaper output but coarser spatial detail.

● Worked example: pooling

A pooling layer is a stride-based downsampler with no learned weights. Max pooling with $F=2,\,S=2$ takes each $2\times2$ block and keeps only its maximum, downsampling each dimension by 2 and discarding 75% of the activations (CS231n). The win: the network becomes a little invariant to small shifts, and the feature maps get cheaper. The cost: you keep "a strong vertical edge is somewhere in this 2x2 region" but lose exactly where, which hurts tasks needing pixel precision like segmentation.

5.Parameter sharing and the receptive field

Here is the move that makes convolutions cheap. The same filter slides over the whole image, so a $3\times3$ filter has exactly 9 weights (plus 1 bias) no matter whether the image is $32\times32$ or $4000\times4000$. This is parameter sharing: one set of weights, reused at every location.

Why is sharing legitimate for images but not for tabular data? A vertical edge is a vertical edge wherever it sits, so a detector that works in the top-left works in the bottom-right too. Reusing weights across positions is exactly right. In a tabular row, feature 1 might be "age" and feature 5 "income"; those have different meanings, so forcing them to share a weight would be nonsense. Sharing only makes sense when the same pattern can appear in many places, which is the defining property of image data.

Parameter sharing also buys a property called translation equivariance: shift the input, and the feature map shifts the same way. The detector follows the pattern around the image (Goodfellow et al., 2016).

▲ Exam-favourite numbers

AlexNet's first layer is the classic counting drill. Input $227\times227\times3$, filter $F=11$, stride $S=4$, no padding, 96 filters. Output: $(227-11)/4 + 1 = 55$, so $55\times55\times96 = 290{,}400$ output neurons. Each neuron connects to $11\times11\times3 = 363$ weights. Without sharing that would be over 100 million weights; with sharing it is just $96\times11\times11\times3 = 34{,}848$ unique weights, plus 96 biases (CS231n). Sharing cut the parameters by roughly 3000 times.

The receptive field is the region of the original input that one output value depends on. A single $3\times3$ conv has a $3\times3$ receptive field. Stack two such layers and the second layer's output depends on a $5\times5$ patch of the input; stack three and it is $7\times7$. The handout's formula:

$$\text{RF} = 1 + L\,(F - 1),$$

where $L$ is the number of stacked layers and $F$ the filter size. This is the trick behind VGG: two stacked $3\times3$ convs cover the same $5\times5$ receptive field as one $5\times5$ conv, but use only $2\times9 = 18$ weights per channel pair instead of 25, a 28% saving, and add an extra nonlinearity in between (on VGG's 3x3 design).

What a convolution layer buys you
  • Local connectivity: each output sees only a small patch, matching how image structure is local.
  • Parameter sharing: one filter, reused everywhere, so parameter count is independent of image size.
  • Translation equivariance: shift the input, the feature map shifts with it.
  • Hierarchy by stacking: early layers catch edges, deeper layers catch objects, as the receptive field grows.

6.In NumPy: convolution from scratch

The whole operation is a few lines. Compute the output size with the formula, then for each output position take the dot product of the filter with the patch underneath it. This matches the post-lecture exercise: implement conv2d with NumPy only and check it against scipy.signal.correlate2d.

conv2d.pyvalid convolution (cross-correlation), single channel
import numpy as np

def conv2d(image, kernel, stride=1, pad=0):
    # zero-pad the border so the filter can reach edge pixels
    if pad:
        image = np.pad(image, pad, mode='constant')
    H, W = image.shape
    F = kernel.shape[0]
    # output size: floor((W - F + 2P)/S) + 1, here pad already applied
    out_h = (H - F) // stride + 1
    out_w = (W - F) // stride + 1
    out = np.zeros((out_h, out_w))
    for i in range(out_h):
        for j in range(out_w):
            r, c = i * stride, j * stride
            patch = image[r:r+F, c:c+F]      # the F x F window
            out[i, j] = np.sum(patch * kernel)  # element-wise multiply, then sum
    return out

# vertical-edge (Sobel-style) filter
K = np.array([[1, 0, -1],
              [2, 0, -2],
              [1, 0, -1]])
img = np.eye(7)                  # a diagonal pattern
print(conv2d(img, K).shape)      # -> (5, 5):  (7 - 3)//1 + 1 = 5

The np.sum(patch * kernel) line is the dot product from section 2. Everything else is bookkeeping for where the window sits. Swap in a different kernel and the same loop detects a different pattern.

7.Quick check

Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.

Primary source to read next: Stanford's CS231n convolutional-networks notes are the cleanest treatment of the output-size formula, parameter sharing, and the AlexNet counts. For raw visual intuition, watch 3Blue1Brown's "But what is a convolution?" and play with the interactive CNN Explainer. The original idea is in LeCun et al., 1998 (LeNet). Stuck on an output-size calculation or the AlexNet 290,400 figure? Ask me and we will work it pixel by pixel.