Backpropagation & Computational Graphs
How a network with a million knobs figures out which way to turn each one.
A modern network is just a giant function with millions of tunable numbers (weights). Training means nudging every one of them to make the loss a little smaller. The only question is: for each weight, which direction is "smaller", and by how much? That's a partial derivative, $\partial L / \partial w$. Backpropagation is the trick that computes all million of them in a single backward sweep, not one at a time.
1.The problem: a million knobs
Suppose you tried the obvious thing. To find how the loss changes when you wiggle one weight, nudge it by a tiny $\epsilon$, re-run the whole network, and see how much the loss moved:
$$\frac{\partial L}{\partial w_i} \approx \frac{L(w_i + \epsilon) - L(w_i)}{\epsilon}$$
This works, but it costs one full forward pass per weight. With 10 million weights, that's 10 million forward passes for a single training step. Useless.
The network computed its output by composing simple operations: multiply, add, squash, repeat. The chain rule says the derivative of a composition is the product of the derivatives of its pieces. So if we already know each piece's local derivative, we never need to re-run anything. We just multiply along the chain. Backprop is the chain rule, organized.
2.The chain rule turns one hard problem into many easy ones
Say the loss $L$ depends on a weight $w$ only through a sequence of intermediate quantities ($w \to z \to h \to \hat{y} \to L$). The chain rule decomposes the gradient into a product of local gradients:
$$\frac{\partial L}{\partial w} \;=\; \underbrace{\frac{\partial L}{\partial \hat{y}}}_{\text{loss}}\cdot \underbrace{\frac{\partial \hat{y}}{\partial h}}_{\text{activation}}\cdot \underbrace{\frac{\partial h}{\partial z}}_{\text{activation}}\cdot \underbrace{\frac{\partial z}{\partial w}}_{\text{linear}}$$
Every factor on the right is something you can write down in one line: the derivative of a single operation. The fearsome $\partial L/\partial w$ is just their product.
3.Computational graphs make it mechanical
Write any computation as a directed acyclic graph (DAG): nodes are operations, edges carry values. The forward pass evaluates left to right and stores every intermediate. The backward pass walks right to left; at each node it multiplies the gradient arriving from downstream (the upstream gradient) by that node's local gradient, and passes the result further back.
Two rules are all of backprop:
- Through a node: upstream gradient × local gradient.
- At a fork (a value used in two places): add the gradients coming back from both paths.
Let's make it concrete with the example from the handout, $f = (x+y)(x-y)$. Set $p = x+y$ and $q = x-y$, so $f = p\cdot q$. Drag the inputs and flip to the backward view:
Notice what the simulator shows at $x=3,\,y=1$: the backward pass gives $\partial f/\partial x = 6$ and $\partial f/\partial y = -2$. Both $x$ and $y$ fork into two paths (into $p$ and into $q$), so their gradients are the sum of what flows back along each path. That single "add at a fork" rule is exactly how a weight that influences the output through many neurons gets its gradient.
Why must we store the forward values? Look at a multiply node $f = a\cdot b$: its local gradients are $\partial f/\partial a = b$ and $\partial f/\partial b = a$: the other input. To send gradients back through a multiply, you need the values that flowed forward through it. No stored forward pass, no backward pass.
4.The whole algorithm is four steps
Every network, from a 2-neuron toy to GPT, trains with the same loop:
- Forward pass: run inputs through the network to get the prediction $\hat{y}$, storing intermediates.
- Loss: compute $L(\hat{y}, y)$, a single number measuring "how wrong".
- Backward pass: sweep right to left, producing $\partial L/\partial w$ for every weight.
- Update: step each weight downhill, $w \leftarrow w - \eta\,\dfrac{\partial L}{\partial w}$.
The handout's worked example is a 2-2-1 MLP with $\hat{y}=0.594$ against a target $y=1.0$, and every gradient came out negative. Here's the intuition test: why does that make sense? The prediction is below the target, so the network is "too low". A negative gradient $\partial L/\partial w < 0$ means increasing $w$ decreases the loss, so gradient descent will raise the weights, pushing $\hat{y}$ up toward $1.0$. The signs are telling the network which way to climb.
5.Gradient flow decides what can learn
Because backprop multiplies local gradients along the chain, the depth of a network is a double-edged sword. If each layer's local gradient is a bit less than 1, the product shrinks toward zero as it travels back. The early layers receive almost no signal and effectively stop learning. That's the vanishing gradient problem.
The classic culprit is the sigmoid, whose derivative maxes out at just $0.25$. The hero is ReLU, whose derivative is exactly $1$ for positive inputs, so gradients pass through unscaled. Slide the depth up and compare:
Sigmoid: $0.25^{10}\approx 10^{-6}$; $0.25^{20}\approx 10^{-12}$, essentially zero: the first layer never moves off its random init. ReLU: $1^{10}=1$, so gradient flows unchanged. This one fact is why ReLU made very deep networks trainable.
6.In NumPy: a tiny autograd node
Every "do I really get it?" check comes down to writing the two rules (multiply-through and add-at-a-fork) as code. Here is the heart of a micrograd-style Value, with the multiply node's backward closure made explicit:
import numpy as np
class Value:
def __init__(self, data, _children=()):
self.data = data # forward value (we MUST store it)
self.grad = 0.0 # dL/d(this), accumulates
self._backward = lambda: None
self._prev = set(_children)
def __mul__(self, other):
out = Value(self.data * other.data, (self, other))
def _backward():
# local grad of a*b is the OTHER input; chain with upstream out.grad
self.grad += other.data * out.grad # += handles forks
other.grad += self.data * out.grad
out._backward = _backward
return out
def __add__(self, other):
out = Value(self.data + other.data, (self, other))
def _backward():
self.grad += 1.0 * out.grad # local grad of a+b is 1
other.grad += 1.0 * out.grad
out._backward = _backward
return out
def backward(self):
topo, seen = [], set()
def build(v):
if v not in seen:
seen.add(v); [build(c) for c in v._prev]; topo.append(v)
build(self)
self.grad = 1.0 # dL/dL = 1
for v in reversed(topo): # right-to-left sweep
v._backward()
Check it against the simulator: with x=Value(3), y=Value(1), and f=(x+y)*(x-y), calling f.backward() gives x.grad == 6 and y.grad == -2. The += is doing the fork-summation for you.
7.Quick check
Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.