Foundations · Week 3 · Session 2

Initialization, Normalization & Debugging

A network can have the right architecture and still never learn. Here is why, and the three fixes that make deep training work.

~20 min read Exam weight: heavy on OA-1 Week 3 and the match-the-following viva Builds on Backpropagation

Backprop sends the gradient backward as a product of per-layer factors. Multiply many numbers below 1 and the product races to zero, so the early layers stop learning. Multiply many numbers above 1 and it blows up to $\infty$. The same thing happens to the activations on the way forward. This lesson is about keeping that running product near 1: pick the starting weights so the variance holds steady (initialization), and keep re-centering activations as training drifts (normalization). When it still breaks, you debug it on purpose instead of by guessing.

1.The problem: signals vanish or explode with depth

Think of a deep network as a chain of relays. Each layer multiplies its input by a weight matrix and squashes it. The signal that reaches layer 30 is the input passed through 30 multiplications in a row. Whether it arrives intact depends entirely on the size of those multiplications.

Strip a layer down to a single scalar weight $w$ repeated $L$ times. The output is $w^{L}$ times the input. That one expression tells the whole story:

$$\hat{y} \;\approx\; w^{L}\,x \qquad\Rightarrow\qquad \begin{cases} w=0.5: & 0.5^{30}\approx 9\times10^{-10} \;\;(\text{vanishes}) \\ w=1.5: & 1.5^{30}\approx 1.9\times10^{5} \;\;(\text{explodes}) \end{cases}$$

The deeplearning.ai initialization notes make this exact point: a constant factor of $0.5$ repeated across layers shrinks the signal as $0.5^{L-1}$, while a factor of $1.5$ grows it as $1.5^{L-1}$ (deeplearning.ai, AI Notes on Initialization). There is no safe constant other than something very close to 1, and even that only works by luck.

⊳ First principles

The gradient and the forward activation are both running products along the depth of the network. A product stays stable only when its factors hover around 1. So the real design target has nothing to do with picking "small weights" or "large weights". It is to keep the variance of the activations roughly constant from one layer to the next. Initialization sets that variance correctly at step 0. Normalization re-enforces it at every step after.

This is the same multiplication that produced vanishing gradients in the backprop lesson. There, sigmoid's derivative capped at $0.25$ and $0.25^{20}\approx 10^{-12}$ killed the signal. Here the culprit is the weights rather than the activation slope, but the mechanism is identical: a product of many factors, each off from 1, decays or detonates.

2.Why you cannot start at zero (or at 0.01)

The obvious first guess is to set every weight to 0. It fails for a reason that has nothing to do with variance.

If every weight in a layer is identical, every neuron in that layer computes the same output. On the backward pass they all receive the same gradient, so they all get the same update, and they stay identical forever. CS231n states it plainly: "there is no source of asymmetry between neurons if their weights are initialized to be the same" (CS231n, Neural Networks 2). A 256-neuron layer collapses into one neuron repeated 256 times. You need random weights just to break that symmetry.

So you reach for the next guess: small random numbers, W = 0.01 * np.random.randn(n_in, n_out). This breaks symmetry, and for a shallow net it is fine. For a deep one it quietly reproduces the vanishing problem. With weights this small, each layer shrinks the activation variance, so by layer 8 or 10 the activations are essentially zero. CS231n again: a layer "that has very small weights will during backpropagation compute very small gradients on its data, since this gradient is proportional to the value of the weights" (CS231n). The signal flowing backward dies just as surely as with sigmoid.

◆ Intuition

Zero init is broken because of symmetry. Tiny init (0.01) is broken because of variance decay. The fix has to do two jobs at once: be random (break symmetry) and have the right scale (hold variance steady). That second requirement is the whole content of Xavier and He.

3.Xavier and He: setting the variance on purpose

Take one neuron computing $s = \sum_{i=1}^{n} w_i x_i$ with $n$ inputs. Assume the $w_i$ and $x_i$ are independent, zero mean. Then the variance of the output is

$$\mathrm{Var}(s) \;=\; \sum_{i=1}^{n}\mathrm{Var}(w_i x_i) \;=\; n\,\mathrm{Var}(w)\,\mathrm{Var}(x).$$

For the output variance to equal the input variance, that prefactor $n\,\mathrm{Var}(w)$ has to be 1. So we need

$$\boxed{\;\mathrm{Var}(w) \;=\; \frac{1}{n}\;}$$

which you get with w = np.random.randn(n) / sqrt(n) (CS231n, variance derivation). This is the core of Xavier (Glorot) initialization. Glorot and Bengio wanted variance held steady on both the forward pass (which cares about $n_{\text{in}}$) and the backward pass (which cares about $n_{\text{out}}$), so they took the compromise

$$\mathrm{Var}(w) \;=\; \frac{2}{n_{\text{in}} + n_{\text{out}}}.$$

That is the formula the viva match-the-following asks for: Xavier is variance $2/(n_{\text{in}}+n_{\text{out}})$, built for linear, tanh, and sigmoid activations (Glorot & Bengio, 2010).

The ReLU correction: why He doubles it

Xavier assumes the activation is roughly linear near 0, so it passes variance through unchanged. ReLU does not. ReLU sets every negative input to 0, which throws away about half the distribution and so cuts the variance of the output roughly in half at every layer. Stack that halving over many layers and the activations collapse again, even with Xavier.

He et al. fixed it by putting the factor of 2 back. If ReLU halves the variance, double the weight variance to cancel it:

$$\boxed{\;\mathrm{Var}(w) \;=\; \frac{2}{n_{\text{in}}}\;}\qquad\Longrightarrow\qquad \texttt{w = np.random.randn(n) * sqrt(2.0/n)}$$

That is He (Kaiming) initialization, the standard for ReLU and its variants (He et al., 2015). The post-lecture handout puts it in one line: ReLU "kills half the values, halving the variance each layer; He init doubles the variance ($2/n_{\text{in}}$ vs $1/n_{\text{in}}$) to compensate."

▲ Exam-favourite numbers

Memorize the pairing. Xavier/Glorot: $\mathrm{Var}(w)=\dfrac{2}{n_{\text{in}}+n_{\text{out}}}$, for tanh and sigmoid. He/Kaiming: $\mathrm{Var}(w)=\dfrac{2}{n_{\text{in}}}$, for ReLU. The trap on the multi-correct MCQ: Xavier with ReLU still collapses because ReLU's halving is unaccounted for. The constant 0.5 vs 1.5 over depth ($0.5^{30}\approx 9\times10^{-10}$, $1.5^{30}\approx 1.9\times10^{5}$) is the worked example for "why constants do not work."

4.See it: activations across 8 layers

This is the live demo from the lecture and Exercise 1 from the handout: build a deep stack, push a random batch through it once with no training, and watch what the per-layer activation spread does. Pick an init scheme and an activation, and read the standard deviation at each layer.

Three patterns to look for. Zeros and tiny (0.01) collapse: each bar is shorter than the last until the activations are flat zero and nothing can learn. Too-large saturates or explodes: tanh pins to $\pm 1$, ReLU grows without bound. Xavier with tanh and He with ReLU hold a steady standard deviation all the way down, which is exactly the "stays flat" curve the handout asks you to find. Switch He onto tanh, or Xavier onto ReLU, and you can watch the mismatch slowly drift.

5.Normalization: fix it during training too

Good initialization makes the variance correct at step 0. But the weights move during training, and after a few thousand updates the activation distribution at layer 10 can drift far from where it started. The early layers keep changing what they output, so every later layer is chasing a moving target. Normalization fixes the distribution back in place at every forward pass.

The operation is the z-score from statistics. For a pre-activation $z$, subtract the mean and divide by the standard deviation, then let the network learn a scale and shift back:

$$\hat{z} = \frac{z - \mu}{\sqrt{\sigma^2 + \epsilon}}, \qquad y = \gamma\,\hat{z} + \beta.$$

The $\gamma$ and $\beta$ are learnable parameters: normalization forces zero mean and unit variance, then $\gamma,\beta$ let the network undo that if it wants to. The whole question of BatchNorm vs LayerNorm is just: over which axis do you compute $\mu$ and $\sigma$?

	Batch Normalization	Layer Normalization
Normalizes over	the batch, per feature	the features, per sample
Statistics depend on	other samples in the batch	only this one sample
Needs batch size > 1	yes (breaks at batch size 1)	no
Train vs eval differ?	yes (running averages at test)	no
Home turf	CNNs, large batches, vision	Transformers, RNNs, variable length

BatchNorm normalizes each feature across the samples in the mini-batch, so its statistics borrow from the whole batch (Ioffe & Szegedy, 2015). LayerNorm normalizes each sample across its own features, so it is self-contained and does not care about batch size (Ba, Kiros & Hinton, 2016). That difference is why Transformers use LayerNorm: they run on variable-length sequences and often a batch size of 1 at inference, where BatchNorm's batch statistics would be meaningless.

▲ Exam-favourite numbers

The single most-tested BatchNorm fact: at inference, BatchNorm does not use the current batch. It uses the running mean and variance accumulated during training, switched on by model.eval(). Forget model.eval() and BatchNorm uses the test batch's own statistics, so your prediction for one image changes depending on which other images sit in the batch. Run the same input 20 times in batches and you get 20 slightly different answers. This is the bonus question in the handout's challenge problem.

6.Debugging a training run

When training fails, the worst thing you can do is change five things at once and rerun. The handout gives a fixed order, and it matches Karpathy's recipe almost line for line. Work it top to bottom.

The debugging checklist

Overfit one batch. Take a handful of examples and drive the loss to near zero. If you cannot memorize 2 examples, the bug is in the model or the training loop, not the data (Karpathy, A Recipe for Training Neural Networks).
Check shapes. Print tensor shapes through one forward pass. A silent broadcast is a classic source of garbage gradients.
Verify the loss at init. A $k$-class softmax that guesses uniformly has loss $-\log(1/k)=\ln k$. For 10 classes that is $\ln 10 \approx 2.30$. If your initial loss is far from that, something is wrong before training even starts.
Gradient check. Compare the analytic gradient from backprop against a finite-difference estimate $\dfrac{f(x+\epsilon)-f(x-\epsilon)}{2\epsilon}$. A relative difference around $10^{-7}$ means backprop is correct.
Train on a small subset, confirm it learns, then scale up.

Two failure signatures worth memorizing, because they map straight to a cause:

● Worked example: the 10-class classifier with initial loss 4.5

Expected loss at init for 10 classes is $\ln 10 \approx 2.30$. You see $4.5$, almost exactly double. The usual culprits: the wrong loss function, mislabeled data, or an architecture bug such as applying softmax twice (so cross-entropy runs on already-normalized inputs). The number tells you to look before the first gradient step, not at the learning rate.

Loss goes to NaN. Three common causes and their fixes. (1) Exploding gradients: lower the learning rate or add gradient clipping. (2) Bad initialization: switch to He init. (3) Numerical overflow such as $\log(0)$ in the loss: add an $\epsilon$ or use a numerically stable cross-entropy. The order of the checklist is what saves the hours: each step rules out a whole class of bugs before you touch the next.

7.In NumPy: BatchNorm from scratch

Exercise 2 from the handout. Write the forward pass with no nn.BatchNorm1d, so the z-score and the learnable $\gamma,\beta$ are explicit. The only subtlety is the axis: for a (batch, features) tensor, the mean is taken over axis=0 (down the batch, per feature).

batchnorm.pymean over the batch axis; gamma and beta are learned

import numpy as np

def batchnorm_forward(x, gamma, beta, eps=1e-5):
    # x: (N, D)  N samples, D features. Normalize each feature over the batch.
    mu  = x.mean(axis=0)                 # (D,) one mean per feature
    var = x.var(axis=0)                  # (D,) one variance per feature
    x_hat = (x - mu) / np.sqrt(var + eps)  # z-score, zero mean & unit var
    out = gamma * x_hat + beta           # learnable scale and shift
    return out, (x_hat, mu, var)

def he_init(n_in, n_out):
    # variance 2 / n_in  ->  std = sqrt(2 / n_in). For ReLU layers.
    return np.random.randn(n_in, n_out) * np.sqrt(2.0 / n_in)

def xavier_init(n_in, n_out):
    # variance 2 / (n_in + n_out). For tanh / sigmoid layers.
    return np.random.randn(n_in, n_out) * np.sqrt(2.0 / (n_in + n_out))

# sanity check: a normalized feature has mean 0, std 1 (before gamma, beta)
x = np.random.randn(32, 64) * 5 + 3      # wide, off-center batch
out, _ = batchnorm_forward(x, gamma=1.0, beta=0.0)
print(out.mean(axis=0)[:3].round(4))     # ~ [0. 0. 0.]
print(out.std(axis=0)[:3].round(4))      # ~ [1. 1. 1.]

Swap the mean axis from axis=0 to axis=1 and you have written LayerNorm instead: same z-score, now per sample across features. That one-character change is the entire difference the exam keeps asking about.

8.Quick check

Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.

Primary sources to read next: the two papers that made deep nets trainable, Glorot & Bengio (2010) for Xavier and He et al. (2015) for the ReLU correction, both short. For normalization read Sections 1 to 3 of Ioffe & Szegedy (2015). The practical companion is Karpathy's "A Recipe for Training Neural Networks" and the CS231n init notes. Stuck on why Xavier dies with ReLU, or on the axis=0 vs axis=1 split? Ask me and we will trace the variance through a few layers by hand.