Foundations · Week 3 · Session 1

Regularization

A network with more knobs than data points will memorize the answers. Here is how we make it learn the pattern instead.

~17 min read Exam weight: heavy on OA-1 Week 3 and the match-the-following viva Builds on Backpropagation

A modern network can have eleven million weights and see only fifty thousand training images. That is about $220$ free parameters for every single example, more than enough room to store the answer to each one by heart. Left alone, the network does exactly that: it drives training loss to zero by memorizing, including the noise, and then fails on anything new. Regularization is the set of tricks that take that capacity away on purpose, so the network is forced to find the rule rather than the lookup table.

1.Overfitting is the default, not the accident

Start with a number from the handout. An ImageNet-scale model with $11\text{M}$ parameters trained on $50\text{K}$ images has roughly $220$ parameters per sample. With that many degrees of freedom, the model can fit any labelling of the training set, even a random one.

That is not a thought experiment. Zhang and colleagues trained standard image networks on CIFAR-10 after shuffling the labels to pure noise, and the networks still reached near-zero training error. They had simply memorized which random label went with which image (Zhang et al., 2017). The same paper showed that turning regularization on or off barely changed this: explicit regularizers are not the only thing standing between you and memorization (Lil'Log, "Are Deep Neural Networks Dramatically Overfitted?").

⊳ First principles

Training minimizes loss on the data you have. Nothing in that objective says "stay simple". So a model with spare capacity will use it to drive training loss down, fitting signal and noise alike. The gap that opens between low training error and high validation error is overfitting. Every regularizer is one answer to the same question: how do we add a preference for simpler explanations without being told which explanation is right?

The picture to hold in your head is two curves against model complexity. Training error falls forever. Validation error falls, bottoms out, then climbs back up. That bottom is the model we want. Everything in this lesson is a way to find it or move it lower.

2.L2 weight decay: give the weights a budget

The simplest pressure toward simplicity is to make large weights cost something. Add a penalty on the squared size of every weight to the loss:

$$\tilde{L}(w) = L(w) + \frac{\lambda}{2}\sum_i w_i^2$$

The $\tfrac12$ is there so the derivative comes out clean. Differentiating the penalty gives $\lambda w$, so the gradient of the full objective is $\nabla L + \lambda w$ (CS231n, Neural Nets II). Plug that into one gradient descent step and the update separates into two parts:

$$w \;\leftarrow\; w - \eta(\nabla L + \lambda w) \;=\; \underbrace{(1 - \eta\lambda)\,w}_{\text{shrink}} \;-\; \underbrace{\eta\,\nabla L}_{\text{usual step}}$$

Read the two pieces. Before the network even looks at the gradient, every weight is multiplied by $(1 - \eta\lambda)$, a number slightly below 1. That is the "decay" in weight decay: each step pulls the weight vector a little closer to the origin. The data still gets a vote through $\nabla L$, but it now has to keep paying rent to hold a weight away from zero.

L2 does not forbid large weights. It charges for them, and only the weights the data really needs end up paying.

Geometrically, the penalty is a circular budget centered at the origin. The optimizer settles where the data's pull and the budget's pull balance, which favors small, spread-out weights over a few spiky ones. CS231n puts it as preferring "diffuse" weight vectors that use many inputs a little over "peaky" ones that lean hard on a few. L1 regularization, which penalizes $\lambda\sum_i |w_i|$ instead, pushes many weights to exactly zero and gives you a sparse model; L2 keeps everything small but nonzero.

▲ Exam-favourite numbers

Memorize the SGD-with-weight-decay update: $w \leftarrow (1 - \eta\lambda)\,w - \eta\,\nabla L$. The $(1-\eta\lambda)$ factor shrinks weights toward the origin each step. A typical strength is $\lambda = 10^{-4}$. One catch the exam loves: in Adam, plain L2 and true weight decay are not the same, because Adam's per-parameter scaling distorts the penalty. The fix is AdamW, which applies the decay directly to the weights, decoupled from the adaptive step (Loshchilov & Hutter, 2019).

3.Dropout: train an ensemble inside one network

Dropout attacks overfitting from a different angle. On each training step, pick a random fraction of the neurons in a layer and set their outputs to zero for that step. A neuron that might be dropped at any moment cannot rely on one specific partner being present, so the layer is pushed to spread its computation across many neurons instead of building one fragile, co-adapted path.

Here is the framing the exam asks about. Each random mask defines a different "thinned" sub-network. A layer of $n$ neurons has $2^n$ possible on/off patterns, so over training you are training an exponential family of sub-networks that all share weights (Srivastava et al., 2014). At test time you want the average of all those sub-networks, which is the ensemble prediction, but running $2^n$ networks is impossible.

◆ Intuition

Think of a group project where, at every meeting, a few random members are absent. Nobody can be the single point of failure, so everyone ends up learning the whole task. The finished team, with everyone present, is stronger and more redundant than one where each person owned exactly one job.

The trick that makes test time cheap is a scaling correction. If you keep each neuron with probability $p$ during training, then on average a neuron's output is only $p$ times what it would be with everyone present. So at test time, with all neurons on, the layer's signal is too big by a factor of $1/p$. Inverted dropout fixes this during training: divide the kept activations by $p$ as you mask, so their expected value already matches the full network. Then test time needs no change at all (CS231n).

▲ Exam-favourite numbers

With keep-probability $p=0.5$, a layer of $n$ neurons spans $2^n$ sub-networks, and inference with the full network approximates averaging them. Inverted dropout scales kept activations by $1/p$ at train time; the test pass is unscaled. The classic catch: if you forget model.eval(), dropout stays on at inference and every forward pass returns a different noisy answer.

4.Early stopping and data augmentation

The cheapest regularizer asks for no penalty and no extra code in the model. Just watch the validation loss while you train. It falls with the training loss for a while, reaches a minimum, then starts to rise even as training loss keeps dropping. That turning point is where the network stops learning the pattern and starts memorizing the training set. Early stopping means: keep the checkpoint from the validation minimum and throw away everything after it.

Why does stopping early act like a penalty on weights? Because weights start small near initialization and grow as training proceeds. Cutting training short caps how far they can travel from zero, which is the same effect L2 produces by pulling them back. Less training time means a smaller effective weight budget.

The second panel of the simulator below makes this concrete: drag the marker to the validation minimum and watch the deploy point land before the validation curve turns up.

Data augmentation takes the opposite route. Instead of shrinking the model, it grows the data. For images you apply label-preserving transforms, a horizontal flip, a small random crop, a color shift, so one labelled cat becomes dozens of slightly different cats that all still read as "cat". The model sees more variety than you collected, so it learns features that survive those transforms rather than pixel-exact templates. The handout's recipe is RandomHorizontalFlip plus RandomCrop(32, padding=4) plus ColorJitter on CIFAR-10.

L2 and dropout shrink the model, data augmentation enlarges the dataset, and early stopping limits how long the model has to overfit. Different mechanisms, same target: close the gap between training and validation error.

5.Which regularizer, in what order

When you see 99% training accuracy and 72% validation accuracy, that gap is the symptom and you have a fixed playbook. The handout's order is worth memorizing because the exam asks for it directly.

The anti-overfitting playbook
  1. Data augmentation first: free training data, label-preserving transforms, almost always helps.
  2. Weight decay next: AdamW with $\lambda \approx 10^{-4}$ as a sane default.
  3. Early stopping: monitor validation loss, keep the best checkpoint.
  4. Dropout with $p$ in the $0.3$ to $0.5$ range, typically after the fully connected layers.
  5. Still overfitting? Get more data or use a smaller model.

Reach for the simulator now. With $\lambda = 0$ the fitted curve whips through every point, noise included: that is overfitting. Turn $\lambda$ up and the curve smooths toward the true trend, and the train and validation errors trace out the U you read about in section 1. Push $\lambda$ too far and the curve flattens into a line that misses the trend entirely: that is underfitting. The sweet spot is the bottom of the validation U.

Notice the asymmetry the simulator shows. Train error keeps falling as you relax the penalty, because a wigglier curve always fits the given points better. Validation error does not: it bottoms out and rises. The whole job of regularization is to ignore what train error wants and chase the bottom of the validation curve.

6.In NumPy: weight decay and inverted dropout

Both core regularizers fit in a few lines. Weight decay adds $\lambda w$ to the gradient before the step; inverted dropout masks and rescales during training and does nothing at test time.

regularizers.pyL2 adds to the gradient; dropout scales by 1/p at train time
import numpy as np

def sgd_step(w, grad, lr=0.1, lam=1e-4):
    # L2 / weight decay: penalty gradient is lam * w
    grad = grad + lam * w                 # d/dw of (lam/2)*w^2 is lam*w
    return w - lr * grad                  # == (1 - lr*lam)*w - lr*grad

def dropout_forward(x, p=0.5, train=True):
    # p = keep-probability. Inverted dropout: scale at TRAIN time.
    if not train:
        return x                          # test time: no change, no scaling
    mask = (np.random.rand(*x.shape) < p) / p   # keep w.p. p, then divide by p
    return x * mask                       # expected value stays the same

# check the math: with lr=0.1, lam=1e-4, the decay factor is
# (1 - lr*lam) = 1 - 1e-5, so each step shrinks w toward 0 by that factor
w = np.array([2.0, -3.0])
print(sgd_step(w, grad=np.zeros_like(w)))   # [1.99998 -2.99997]: pure decay

The second print is the whole idea of weight decay in one line: with a zero data gradient, the only thing acting on the weights is the shrink factor $(1 - \eta\lambda)$, so they creep toward the origin. The dropout mask divides by $p$ as it zeros neurons, which is why the test-time pass needs no rescaling.

7.Quick check

Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.

Primary source to read next: Srivastava et al., "Dropout: A Simple Way to Prevent Neural Networks from Overfitting" (read sections 1 to 4 for the ensemble argument, section 7 for practical settings), and the CS231n regularization notes for L2, L1, max-norm, and inverted dropout in code. For the memorization result, Zhang et al., 2017. Want to grind through the $220$ params-per-sample count, or why AdamW differs from Adam plus L2? Ask me and we will work both by hand.