Loss Functions & Optimization
Two questions finish the training loop: what number do we push down, and how do we use the gradient once we have it?
Backpropagation handed you a gradient $\partial L/\partial w$ for every weight. That is only half a training loop. You still have to choose what loss $L$ you are differentiating, and you still have to decide what step to take once the gradient points the way. Pick the wrong loss and the gradient vanishes exactly when the model is most wrong. Take the gradient at face value and you zig-zag across a narrow valley for a thousand steps. This lesson fixes both: the right loss for the job, and the optimizers (SGD, momentum, Adam) that turn a gradient into a smart step.
1.Backprop gave us gradients. Now what?
The update rule from last session was one line: $w \leftarrow w - \eta\,\partial L/\partial w$. Subtract a bit of the gradient, repeat. So why a whole lecture?
Because that one line hides two choices that decide whether training works at all.
- The loss $L$ shapes the gradient itself. A regression loss and a classification loss produce completely different gradients for the same prediction, and one of them can flatline to zero while the model is still badly wrong.
- The step is more than "go downhill". On a stretched-out loss surface the raw gradient points mostly across the valley, not along it. Take it literally and you oscillate. The optimizer is the part that fixes the direction and scales the step per parameter.
Picture the loss as a surface of hills and valleys, with the parameters as your position on it. CS231n puts it as hiking downhill blindfolded: you feel the slope under your feet and step the way that drops fastest. The gradient is that slope. Gradient descent is the rule that the steepest downhill direction is exactly $-\nabla L$ (CS231n, Optimization). Everything in this lesson is a better answer to "I felt the slope, now what step do I actually take?"
2.The loss is the grading rubric
A loss function turns "how wrong was that prediction?" into a single number you can minimize. The choice of rubric decides how harshly different mistakes get punished, and that is the whole game, because the gradient of the rubric is what trains the network.
For regression, where the target is a real number, the default is mean squared error:
$$L_{\text{MSE}} = \frac{1}{n}\sum_{i}(\hat{y}_i - y_i)^2$$
Squaring does two things. It makes every error positive so they cannot cancel, and it punishes big misses far more than small ones (an error of 4 costs 16, an error of 2 costs 4). Its gradient with respect to a single prediction is clean: $\partial L/\partial \hat{y} = 2(\hat{y} - y)$, proportional to the error. Far off, large push. Close, small push.
For classification, where the target is a class and the model outputs probabilities, the default is cross-entropy. With one true class and predicted probabilities $\hat{y}$ from a softmax, the loss for one example is just the negative log of the probability the model gave the correct class (CS231n, Linear classification):
$$L_{\text{CE}} = -\sum_{k} y_k \log \hat{y}_k \;=\; -\log \hat{y}_{\text{correct}}, \qquad \hat{y}_k = \frac{e^{z_k}}{\sum_j e^{z_j}}$$
The softmax on the right squashes raw scores $z_k$ (logits) into probabilities that sum to 1. The $-\log$ in front is the punishment curve. Get the right class confidently right ($\hat{y} \to 1$) and the loss goes to $-\log 1 = 0$. Get it confidently wrong ($\hat{y} \to 0$) and the loss shoots toward infinity.
True class is index 1, one-hot $[0,1,0]$, model says $[0.2, 0.7, 0.1]$. The loss only looks at the correct class: $L = -\log(0.7) \approx 0.36$. The other two probabilities never enter the sum because their $y_k$ is 0. This is the exact OA-1 question, and the trap answers are $-\log(0.2)$ and $-\log(0.1)$, which read off the wrong slot.
3.Why cross-entropy beats MSE for classification
You could, in principle, train a classifier with MSE on top of a sigmoid. People did. It trains slowly and sometimes stalls, and the reason is a short piece of calculus worth memorizing.
Take binary classification: a sigmoid output $\hat{y} = \sigma(z)$ and a label $y \in \{0,1\}$. The pre-activation is $z$. We care about $\partial L/\partial z$, because that is what flows back into the weights. Recall the sigmoid derivative $\sigma'(z) = \sigma(z)(1-\sigma(z))$, which peaks at $0.25$ and collapses to near zero when the neuron saturates.
MSE path. With $L = (\hat{y}-y)^2$, the chain rule gives
$$\frac{\partial L}{\partial z} = 2(\hat{y}-y)\,\sigma'(z).$$
That $\sigma'(z)$ factor is the problem. Suppose the network is confidently wrong: $y=1$ but $\hat{y}=0.02$. The error $(\hat{y}-y)$ is almost $-1$, big. But $\sigma'(z) = 0.02 \times 0.98 \approx 0.0196$, tiny. Their product is tiny. The gradient nearly vanishes precisely when the model is most wrong, so it barely corrects.
Cross-entropy path. With binary cross-entropy $L = -y\log\hat{y} - (1-y)\log(1-\hat{y})$, the same chain rule produces an exact cancellation (cross-entropy derivation):
$$\frac{\partial L}{\partial z} = \hat{y} - y.$$
The $\sigma'(z)$ cancels. The gradient is just the error. Confidently wrong now means a gradient near $\pm 1$, a strong correction. The same cancellation holds for softmax with categorical cross-entropy: the gradient at the logits is $\hat{y} - y$ (Goodfellow et al., Ch. 6.2).
Model predicts $\hat{y}=0.02$ for a positive example ($y=1$). Cross-entropy loss $L = -\log(0.02) \approx 3.9$. MSE loss $(1-0.02)^2 = 0.96$. Cross-entropy assigns a far bigger penalty to a confident mistake. Remember the log values too: $-\log(0.01) \approx 4.6$, $-\log(0.99) \approx 0.01$. And the sigmoid cap: $\sigma'(z) \le 0.25$.
Use the panel below to watch the two gradients diverge as the prediction gets more confidently wrong. The cross-entropy gradient grows; the MSE gradient gets crushed by $\sigma'$.
4.Batch, stochastic, mini-batch
Now the second choice: how to use the gradient. The honest gradient is the average over the whole training set. Computing it is batch gradient descent: one update per full pass. Accurate, but for a dataset of millions you wait forever between steps.
The other extreme is stochastic gradient descent (SGD): estimate the gradient from a single random example and update immediately. Cheap and fast, but each estimate is noisy, so the path jitters.
Mini-batch is the practical middle. Average the gradient over a small batch (32 to 256 examples) and update. Each estimate is a good approximation of the full gradient because training data has redundancy, and a batch of a few hundred is close to the full average while being vastly cheaper (CS231n, Optimization). Batch sizes are usually powers of 2 (32, 64, 128, 256) so the hardware stays busy.
The noise in mini-batch SGD is not purely a cost. A noisy gradient can knock the optimizer out of a bad sharp minimum and toward a flatter one that tends to generalize better. That is part of why very large batches, which give an almost noise-free gradient, can actually hurt test accuracy even though each epoch runs faster.
5.Momentum: roll, don't step
Here is the failure that motivates everything after plain SGD. Picture a loss surface shaped like a long narrow valley: steep walls in one direction, a gentle slope along the floor. The exam draws this as elongated concentric ellipses. The negative gradient mostly points across the valley toward the nearest wall, not along it toward the minimum. So plain SGD bounces wall to wall, taking tiny net progress down the floor. It zig-zags.
Momentum fixes this by giving the optimizer memory. Instead of stepping by the current gradient, accumulate a velocity that is an exponential moving average of past gradients, and step by the velocity (Ruder, 2016):
$$v_t = \beta\,v_{t-1} + \eta\,\nabla L, \qquad w \leftarrow w - v_t, \qquad \beta = 0.9.$$
Along the valley floor the gradient keeps pointing the same way, so those contributions add up and the velocity builds, like a ball rolling downhill picking up speed. Across the valley the gradient flips sign every step, so consecutive contributions cancel in the average and the side-to-side bouncing gets damped. The oscillation shrinks; the forward roll grows.
With decay $\beta = 0.9$, the velocity remembers roughly the last $1/(1-\beta) = 10$ gradients. That same effective-window math is the engine inside both momentum and the squared-gradient average in RMSProp. If you remember one number for exponential moving averages, remember that $\beta=0.9$ means a window of about 10.
6.Adam and adaptive rates
Momentum fixes direction. It still uses one global learning rate for every parameter, which is like one shoe size for every foot. Some parameters see large gradients and want small steps; others see tiny gradients and want large steps.
AdaGrad and RMSProp handle this by scaling each parameter's step by its own gradient history. Track a running average of the squared gradient, then divide the step by its square root. A parameter with consistently large gradients gets a smaller effective step; a quiet parameter gets a larger one (Ruder, 2016).
Adam combines both ideas plus a correction. It keeps a first moment $m_t$ (momentum) and a second moment $v_t$ (RMSProp's squared-gradient average):
$$m_t = \beta_1 m_{t-1} + (1-\beta_1)\,g_t, \qquad v_t = \beta_2 v_{t-1} + (1-\beta_2)\,g_t^2$$
Both start at zero, so early in training they are biased toward zero. Adam corrects that bias, then steps:
$$\hat{m}_t = \frac{m_t}{1-\beta_1^{\,t}}, \qquad \hat{v}_t = \frac{v_t}{1-\beta_2^{\,t}}, \qquad w \leftarrow w - \eta\,\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}$$
Read the last update as momentum in the numerator and per-parameter scaling in the denominator. The three pieces to name on the exam are momentum, RMSProp, and bias correction (Kingma & Ba, 2015).
Adam defaults: $\beta_1 = 0.9$ (momentum), $\beta_2 = 0.999$ (squared gradient), $\epsilon = 10^{-8}$, and $\eta = 0.001$. The instructor's rule: start with Adam at $\eta = 0.001$, it works for about 90% of problems. Adam = momentum + RMSProp + bias correction.
Race the three optimizers on the same elongated valley below. Watch SGD zig-zag, momentum smooth the path, and Adam adapt its step per direction.
7.Learning rate schedules: start large, end small
A fixed learning rate forces one compromise for all of training. Early on you want big steps to cover ground quickly. Late on you want small steps to settle into the minimum without bouncing past it. A schedule changes $\eta$ over time so you get both.
Cosine annealing decays the rate smoothly from $\eta_{\max}$ to $\eta_{\min}$ along half a cosine curve, with no extra hyperparameter for when to drop it (Loshchilov & Hutter, 2017):
$$\eta_t = \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\!\frac{\pi t}{T}\right)$$
At $t=0$ this gives $\eta_{\max}$; at $t=T$ it gives $\eta_{\min}$. The flat tail near the end lets the optimizer settle precisely. Warmup does the opposite at the start: ramp the rate up from near zero over the first 1 to 5% of training, so the first few random-gradient steps do not blow up. Warmup matters most for Transformers and large batches.
A colleague will tell you "Adam has momentum and adapts per parameter, so I don't need a schedule." That misses what a schedule does. Adam's second moment rescales the step per parameter; it never shrinks the global step over time. A schedule does exactly that: explore broadly early, settle precisely late. The two solve different problems, which is why Adam plus cosine annealing is a common pairing.
8.In NumPy: SGD, momentum, Adam in one place
The whole optimizer story is a few lines of arithmetic on the gradient buffer. Here is each rule on the handout's test problem, the quadratic $f(x,y) = x^2 + 10y^2$, whose stretched bowl makes plain SGD zig-zag.
import numpy as np
# f(x,y) = x^2 + 10 y^2 -> grad = [2x, 20y]. The 20 vs 2 makes the bowl elongated.
def grad(p):
return np.array([2*p[0], 20*p[1]])
def sgd(p, lr=0.05):
return p - lr * grad(p) # raw step, zig-zags across the valley
def momentum(p, v, lr=0.05, beta=0.9):
v = beta * v + lr * grad(p) # velocity = EMA of past gradients
return p - v, v # consistent direction builds up
def adam(p, m, v, t, lr=0.05, b1=0.9, b2=0.999, eps=1e-8):
g = grad(p)
m = b1 * m + (1 - b1) * g # 1st moment (momentum)
v = b2 * v + (1 - b2) * g * g # 2nd moment (squared grad)
mhat = m / (1 - b1**t) # bias correction
vhat = v / (1 - b2**t)
return p - lr * mhat / (np.sqrt(vhat) + eps), m, v
# run momentum from the handout's start (5, 3)
p, v = np.array([5.0, 3.0]), np.zeros(2)
for _ in range(50):
p, v = momentum(p, v)
print(p) # close to [0, 0]: momentum cancels the y-axis bounce and rolls to the minimum
The only difference between the three functions is how they turn grad(p) into a step. SGD uses it raw. Momentum averages it over time. Adam averages it and divides by the square root of its own variance, per coordinate.
9.Quick check
Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.