Regularization
A network with more knobs than data points will memorize the answers. Here is how we make it learn the pattern instead.
A modern network can have eleven million weights and see only fifty thousand training images. That is about $220$ free parameters for every single example, more than enough room to store the answer to each one by heart. Left alone, the network does exactly that: it drives training loss to zero by memorizing, including the noise, and then fails on anything new. Regularization is the set of tricks that take that capacity away on purpose, so the network is forced to find the rule rather than the lookup table.
1.Overfitting is the default, not the accident
Start with a number from the handout. An ImageNet-scale model with $11\text{M}$ parameters trained on $50\text{K}$ images has roughly $220$ parameters per sample. With that many degrees of freedom, the model can fit any labelling of the training set, even a random one.
That is not a thought experiment. Zhang and colleagues trained standard image networks on CIFAR-10 after shuffling the labels to pure noise, and the networks still reached near-zero training error. They had simply memorized which random label went with which image (Zhang et al., 2017). The same paper showed that turning regularization on or off barely changed this: explicit regularizers are not the only thing standing between you and memorization (Lil'Log, "Are Deep Neural Networks Dramatically Overfitted?").
Training minimizes loss on the data you have. Nothing in that objective says "stay simple". So a model with spare capacity will use it to drive training loss down, fitting signal and noise alike. The gap that opens between low training error and high validation error is overfitting. Every regularizer is one answer to the same question: how do we add a preference for simpler explanations without being told which explanation is right?
The picture to hold in your head is two curves against model complexity. Training error falls forever. Validation error falls, bottoms out, then climbs back up. That bottom is the model we want. Everything in this lesson is a way to find it or move it lower.
2.L2 weight decay: give the weights a budget
The simplest pressure toward simplicity is to make large weights cost something. Add a penalty on the squared size of every weight to the loss:
$$\tilde{L}(w) = L(w) + \frac{\lambda}{2}\sum_i w_i^2$$
The $\tfrac12$ is there so the derivative comes out clean. Differentiating the penalty gives $\lambda w$, so the gradient of the full objective is $\nabla L + \lambda w$ (CS231n, Neural Nets II). Plug that into one gradient descent step and the update separates into two parts:
$$w \;\leftarrow\; w - \eta(\nabla L + \lambda w) \;=\; \underbrace{(1 - \eta\lambda)\,w}_{\text{shrink}} \;-\; \underbrace{\eta\,\nabla L}_{\text{usual step}}$$
Read the two pieces. Before the network even looks at the gradient, every weight is multiplied by $(1 - \eta\lambda)$, a number slightly below 1. That is the "decay" in weight decay: each step pulls the weight vector a little closer to the origin. The data still gets a vote through $\nabla L$, but it now has to keep paying rent to hold a weight away from zero.
Geometrically, the penalty is a circular budget centered at the origin. The optimizer settles where the data's pull and the budget's pull balance, which favors small, spread-out weights over a few spiky ones. CS231n puts it as preferring "diffuse" weight vectors that use many inputs a little over "peaky" ones that lean hard on a few. L1 regularization, which penalizes $\lambda\sum_i |w_i|$ instead, pushes many weights to exactly zero and gives you a sparse model; L2 keeps everything small but nonzero.
Memorize the SGD-with-weight-decay update: $w \leftarrow (1 - \eta\lambda)\,w - \eta\,\nabla L$. The $(1-\eta\lambda)$ factor shrinks weights toward the origin each step. A typical strength is $\lambda = 10^{-4}$. One catch the exam loves: in Adam, plain L2 and true weight decay are not the same, because Adam's per-parameter scaling distorts the penalty. The fix is AdamW, which applies the decay directly to the weights, decoupled from the adaptive step (Loshchilov & Hutter, 2019).
3.Dropout: train an ensemble inside one network
Dropout attacks overfitting from a different angle. On each training step, pick a random fraction of the neurons in a layer and set their outputs to zero for that step. A neuron that might be dropped at any moment cannot rely on one specific partner being present, so the layer is pushed to spread its computation across many neurons instead of building one fragile, co-adapted path.
Here is the framing the exam asks about. Each random mask defines a different "thinned" sub-network. A layer of $n$ neurons has $2^n$ possible on/off patterns, so over training you are training an exponential family of sub-networks that all share weights (Srivastava et al., 2014). At test time you want the average of all those sub-networks, which is the ensemble prediction, but running $2^n$ networks is impossible.
Think of a group project where, at every meeting, a few random members are absent. Nobody can be the single point of failure, so everyone ends up learning the whole task. The finished team, with everyone present, is stronger and more redundant than one where each person owned exactly one job.
The trick that makes test time cheap is a scaling correction. If you keep each neuron with probability $p$ during training, then on average a neuron's output is only $p$ times what it would be with everyone present. So at test time, with all neurons on, the layer's signal is too big by a factor of $1/p$. Inverted dropout fixes this during training: divide the kept activations by $p$ as you mask, so their expected value already matches the full network. Then test time needs no change at all (CS231n).
With keep-probability $p=0.5$, a layer of $n$ neurons spans $2^n$ sub-networks, and inference with the full network approximates averaging them. Inverted dropout scales kept activations by $1/p$ at train time; the test pass is unscaled. The classic catch: if you forget model.eval(), dropout stays on at inference and every forward pass returns a different noisy answer.
4.Early stopping and data augmentation
The cheapest regularizer asks for no penalty and no extra code in the model. Just watch the validation loss while you train. It falls with the training loss for a while, reaches a minimum, then starts to rise even as training loss keeps dropping. That turning point is where the network stops learning the pattern and starts memorizing the training set. Early stopping means: keep the checkpoint from the validation minimum and throw away everything after it.
Why does stopping early act like a penalty on weights? Because weights start small near initialization and grow as training proceeds. Cutting training short caps how far they can travel from zero, which is the same effect L2 produces by pulling them back. Less training time means a smaller effective weight budget.
The second panel of the simulator below makes this concrete: drag the marker to the validation minimum and watch the deploy point land before the validation curve turns up.
Data augmentation takes the opposite route. Instead of shrinking the model, it grows the data. For images you apply label-preserving transforms, a horizontal flip, a small random crop, a color shift, so one labelled cat becomes dozens of slightly different cats that all still read as "cat". The model sees more variety than you collected, so it learns features that survive those transforms rather than pixel-exact templates. The handout's recipe is RandomHorizontalFlip plus RandomCrop(32, padding=4) plus ColorJitter on CIFAR-10.
5.Which regularizer, in what order
When you see 99% training accuracy and 72% validation accuracy, that gap is the symptom and you have a fixed playbook. The handout's order is worth memorizing because the exam asks for it directly.
- Data augmentation first: free training data, label-preserving transforms, almost always helps.
- Weight decay next: AdamW with $\lambda \approx 10^{-4}$ as a sane default.
- Early stopping: monitor validation loss, keep the best checkpoint.
- Dropout with $p$ in the $0.3$ to $0.5$ range, typically after the fully connected layers.
- Still overfitting? Get more data or use a smaller model.
Reach for the simulator now. With $\lambda = 0$ the fitted curve whips through every point, noise included: that is overfitting. Turn $\lambda$ up and the curve smooths toward the true trend, and the train and validation errors trace out the U you read about in section 1. Push $\lambda$ too far and the curve flattens into a line that misses the trend entirely: that is underfitting. The sweet spot is the bottom of the validation U.
Notice the asymmetry the simulator shows. Train error keeps falling as you relax the penalty, because a wigglier curve always fits the given points better. Validation error does not: it bottoms out and rises. The whole job of regularization is to ignore what train error wants and chase the bottom of the validation curve.
6.In NumPy: weight decay and inverted dropout
Both core regularizers fit in a few lines. Weight decay adds $\lambda w$ to the gradient before the step; inverted dropout masks and rescales during training and does nothing at test time.
import numpy as np
def sgd_step(w, grad, lr=0.1, lam=1e-4):
# L2 / weight decay: penalty gradient is lam * w
grad = grad + lam * w # d/dw of (lam/2)*w^2 is lam*w
return w - lr * grad # == (1 - lr*lam)*w - lr*grad
def dropout_forward(x, p=0.5, train=True):
# p = keep-probability. Inverted dropout: scale at TRAIN time.
if not train:
return x # test time: no change, no scaling
mask = (np.random.rand(*x.shape) < p) / p # keep w.p. p, then divide by p
return x * mask # expected value stays the same
# check the math: with lr=0.1, lam=1e-4, the decay factor is
# (1 - lr*lam) = 1 - 1e-5, so each step shrinks w toward 0 by that factor
w = np.array([2.0, -3.0])
print(sgd_step(w, grad=np.zeros_like(w))) # [1.99998 -2.99997]: pure decay
The second print is the whole idea of weight decay in one line: with a zero data gradient, the only thing acting on the weights is the shrink factor $(1 - \eta\lambda)$, so they creep toward the origin. The dropout mask divides by $p$ as it zeros neurons, which is why the test-time pass needs no rescaling.
7.Quick check
Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.