Convolutions
How a 9-number filter sees an entire image, no matter how big it gets.
Feed a tiny $128\times128$ colour photo into one fully-connected layer of 256 neurons and you have already burned 12.6 million weights, before the network learns a single thing. Worse, that layer treats pixel $(0,0)$ and pixel $(64,64)$ as unrelated strangers, even though a cat's ear looks the same wherever it sits in the frame. A convolution fixes both problems with one move: scan the image with a small filter of shared weights. The same 9 numbers, reused everywhere.
1.Images break fully-connected layers
A fully-connected (FC) layer wires every input to every output. That is fine for a 100-feature tabular row. It falls apart on an image, for two separate reasons.
Reason one: the parameter count explodes. Flatten a $128\times128\times3$ image and you get $49{,}152$ inputs. Connect those to a modest layer of 256 neurons and the weight count is
$$128 \times 128 \times 3 \times 256 = 12{,}582{,}912 \text{ weights}.$$
That is one layer. The full handout number to remember: a $128\times128\times3$ image into 256 FC neurons gives 12,582,912 weights, which overfits small datasets and costs a fortune to train. A $224\times224\times3$ image into 1000 neurons would need over 150 million.
Reason two: flattening throws away geometry. Once you flatten, the network has no idea that two pixels were neighbours. It has to relearn, from scratch, that adjacent pixels relate to each other, and it has to relearn that fact separately for every position in the image.
Images have two properties an FC layer ignores. Locality: a pixel's meaning comes mostly from its immediate neighbours, not from a pixel 200 rows away. Translation: an edge or a texture means the same thing wherever it appears. A layer built to exploit both will need far fewer weights and will generalise better. That layer is the convolution.
2.The convolution is slide, multiply, sum
Take a small grid of weights, the filter (or kernel), say $3\times3$. Place it over the top-left corner of the image. Multiply each filter weight by the pixel underneath it, add up all 9 products, and write that single number into the output. Then slide the filter one step to the right and repeat. When you reach the edge, drop down a row and start again.
Formally, for filter $K$ over an input patch, each output value is
$$y_{i,j} = \sum_{m}\sum_{n} K_{m,n}\,\cdot\,x_{\,i+m,\,j+n}.$$
That inner double sum is just a dot product between the filter and the patch of image it currently covers. A convolution is a dot product, repeated at every location.
Think of the filter as a tiny stencil for one pattern, a vertical edge, a patch of blue, a corner. Where the image matches the stencil, the dot product is large and the output lights up. Where it does not match, the output stays near zero. Sliding the same stencil everywhere is how the layer answers "where does this pattern occur?" across the whole image with one set of 9 weights.
A note on naming. What deep-learning libraries call "convolution" is technically cross-correlation: the real mathematical convolution first flips the kernel. The flip does not matter for learning, because the network just learns the weights in whatever orientation it needs. As Goodfellow's text puts it, learning with cross-correlation instead of true convolution simply gives you a flipped kernel, and the result is identical. (Deep Learning, ch. 9.)
Different filters detect different things. Below you can step a filter across a $7\times7$ input and watch the output feature map fill in, one position at a time. Try the preset edge, blur, and sharpen kernels.
3.One formula sets every output size
How big is the output? Count how many positions the filter can occupy as it slides. With an input of width $W$, a filter of size $F$, padding $P$ on each side, and stride $S$:
$$\text{out} = \left\lfloor \frac{W - F + 2P}{S} \right\rfloor + 1.$$
Read it from the pieces. The filter's left edge can start at position 0 and slide until its right edge hits the (padded) boundary, which leaves $W - F + 2P$ pixels of travel. Divide by the stride to count the steps, and add 1 for the starting position. The same formula runs independently for height. (Source: Stanford CS231n.)
The handout's worked case: a $13\times13$ input, a $5\times5$ filter, padding $P=2$, stride $S=2$.
$$\frac{13 - 5 + 2(2)}{2} + 1 = \frac{12}{2} + 1 = 7.$$
So the output is $7\times7$. Memorise the three sub-steps: numerator $W - F + 2P = 12$, divide by stride $=6$, add one $=7$. Examiners love handing you $W,F,P,S$ and asking for the output size.
One trap: the numerator must be divisible by the stride, or the filter does not tile the input cleanly. CS231n's example: $W=10,\,F=3,\,P=0,\,S=2$ gives $(10-3)/2 + 1 = 4.5$, which is not a whole number, so that configuration is invalid. The floor in the formula is what frameworks use to round it down, but you should treat a non-integer result as a sign your hyperparameters do not fit.
4.Padding and stride are your two dials
Without padding, every convolution shrinks the image. A $3\times3$ filter on a $7\times7$ input (stride 1) gives $(7-3)/1 + 1 = 5$, so you lose a ring of pixels. Stack a few layers and the picture vanishes. The fix is zero-padding: ring the input with zeros so the filter can sit over the border pixels too.
The padding that keeps the output the same size as the input is called "same" padding. For stride 1, set
$$P = \frac{F - 1}{2}.$$
For a $3\times3$ filter that is $P=1$; for $5\times5$ it is $P=2$. Check it: $7\times7$ input, $3\times3$ filter, $P=1$, stride 1 gives $(7-3+2)/1 + 1 = 7$. Same size in, same size out. (Source: CS231n.)
Stride is how far the filter jumps each step. Stride 1 looks at every position; stride 2 skips every other one, roughly halving each output dimension and downsampling the feature map. Larger strides mean a smaller, cheaper output but coarser spatial detail.
A pooling layer is a stride-based downsampler with no learned weights. Max pooling with $F=2,\,S=2$ takes each $2\times2$ block and keeps only its maximum, downsampling each dimension by 2 and discarding 75% of the activations (CS231n). The win: the network becomes a little invariant to small shifts, and the feature maps get cheaper. The cost: you keep "a strong vertical edge is somewhere in this 2x2 region" but lose exactly where, which hurts tasks needing pixel precision like segmentation.
6.In NumPy: convolution from scratch
The whole operation is a few lines. Compute the output size with the formula, then for each output position take the dot product of the filter with the patch underneath it. This matches the post-lecture exercise: implement conv2d with NumPy only and check it against scipy.signal.correlate2d.
import numpy as np
def conv2d(image, kernel, stride=1, pad=0):
# zero-pad the border so the filter can reach edge pixels
if pad:
image = np.pad(image, pad, mode='constant')
H, W = image.shape
F = kernel.shape[0]
# output size: floor((W - F + 2P)/S) + 1, here pad already applied
out_h = (H - F) // stride + 1
out_w = (W - F) // stride + 1
out = np.zeros((out_h, out_w))
for i in range(out_h):
for j in range(out_w):
r, c = i * stride, j * stride
patch = image[r:r+F, c:c+F] # the F x F window
out[i, j] = np.sum(patch * kernel) # element-wise multiply, then sum
return out
# vertical-edge (Sobel-style) filter
K = np.array([[1, 0, -1],
[2, 0, -2],
[1, 0, -1]])
img = np.eye(7) # a diagonal pattern
print(conv2d(img, K).shape) # -> (5, 5): (7 - 3)//1 + 1 = 5
The np.sum(patch * kernel) line is the dot product from section 2. Everything else is bookkeeping for where the window sits. Swap in a different kernel and the same loop detects a different pattern.
7.Quick check
Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.