Foundations · Week 2 · Session 3

Backpropagation & Computational Graphs

How a network with a million knobs figures out which way to turn each one.

~18 min read Exam weight: heavy on OA-1 and the Final Builds on Perceptron → MLP

A modern network is just a giant function with millions of tunable numbers (weights). Training means nudging every one of them to make the loss a little smaller. The only question is: for each weight, which direction is "smaller", and by how much? That's a partial derivative, $\partial L / \partial w$. Backpropagation is the trick that computes all million of them in a single backward sweep, not one at a time.

1.The problem: a million knobs

Suppose you tried the obvious thing. To find how the loss changes when you wiggle one weight, nudge it by a tiny $\epsilon$, re-run the whole network, and see how much the loss moved:

$$\frac{\partial L}{\partial w_i} \approx \frac{L(w_i + \epsilon) - L(w_i)}{\epsilon}$$

This works, but it costs one full forward pass per weight. With 10 million weights, that's 10 million forward passes for a single training step. Useless.

⊳ First principles

The network computed its output by composing simple operations: multiply, add, squash, repeat. The chain rule says the derivative of a composition is the product of the derivatives of its pieces. So if we already know each piece's local derivative, we never need to re-run anything. We just multiply along the chain. Backprop is the chain rule, organized.

2.The chain rule turns one hard problem into many easy ones

Say the loss $L$ depends on a weight $w$ only through a sequence of intermediate quantities ($w \to z \to h \to \hat{y} \to L$). The chain rule decomposes the gradient into a product of local gradients:

$$\frac{\partial L}{\partial w} \;=\; \underbrace{\frac{\partial L}{\partial \hat{y}}}_{\text{loss}}\cdot \underbrace{\frac{\partial \hat{y}}{\partial h}}_{\text{activation}}\cdot \underbrace{\frac{\partial h}{\partial z}}_{\text{activation}}\cdot \underbrace{\frac{\partial z}{\partial w}}_{\text{linear}}$$

Every factor on the right is something you can write down in one line: the derivative of a single operation. The fearsome $\partial L/\partial w$ is just their product.

The hard global question ("how does this deep-buried weight affect the final loss?") becomes a chain of easy local questions, each about a single operation.

3.Computational graphs make it mechanical

Write any computation as a directed acyclic graph (DAG): nodes are operations, edges carry values. The forward pass evaluates left to right and stores every intermediate. The backward pass walks right to left; at each node it multiplies the gradient arriving from downstream (the upstream gradient) by that node's local gradient, and passes the result further back.

Two rules are all of backprop:

Through a node: upstream gradient × local gradient.
At a fork (a value used in two places): add the gradients coming back from both paths.

Let's make it concrete with the example from the handout, $f = (x+y)(x-y)$. Set $p = x+y$ and $q = x-y$, so $f = p\cdot q$. Drag the inputs and flip to the backward view:

Notice what the simulator shows at $x=3,\,y=1$: the backward pass gives $\partial f/\partial x = 6$ and $\partial f/\partial y = -2$. Both $x$ and $y$ fork into two paths (into $p$ and into $q$), so their gradients are the sum of what flows back along each path. That single "add at a fork" rule is exactly how a weight that influences the output through many neurons gets its gradient.

◆ Intuition

Why must we store the forward values? Look at a multiply node $f = a\cdot b$: its local gradients are $\partial f/\partial a = b$ and $\partial f/\partial b = a$: the other input. To send gradients back through a multiply, you need the values that flowed forward through it. No stored forward pass, no backward pass.

4.The whole algorithm is four steps

Every network, from a 2-neuron toy to GPT, trains with the same loop:

The backprop loop

Forward pass: run inputs through the network to get the prediction $\hat{y}$, storing intermediates.
Loss: compute $L(\hat{y}, y)$, a single number measuring "how wrong".
Backward pass: sweep right to left, producing $\partial L/\partial w$ for every weight.
Update: step each weight downhill, $w \leftarrow w - \eta\,\dfrac{\partial L}{\partial w}$.

The handout's worked example is a 2-2-1 MLP with $\hat{y}=0.594$ against a target $y=1.0$, and every gradient came out negative. Here's the intuition test: why does that make sense? The prediction is below the target, so the network is "too low". A negative gradient $\partial L/\partial w < 0$ means increasing $w$ decreases the loss, so gradient descent will raise the weights, pushing $\hat{y}$ up toward $1.0$. The signs are telling the network which way to climb.

5.Gradient flow decides what can learn

Because backprop multiplies local gradients along the chain, the depth of a network is a double-edged sword. If each layer's local gradient is a bit less than 1, the product shrinks toward zero as it travels back. The early layers receive almost no signal and effectively stop learning. That's the vanishing gradient problem.

The classic culprit is the sigmoid, whose derivative maxes out at just $0.25$. The hero is ReLU, whose derivative is exactly $1$ for positive inputs, so gradients pass through unscaled. Slide the depth up and compare:

▲ Exam-favourite numbers

Sigmoid: $0.25^{10}\approx 10^{-6}$; $0.25^{20}\approx 10^{-12}$, essentially zero: the first layer never moves off its random init. ReLU: $1^{10}=1$, so gradient flows unchanged. This one fact is why ReLU made very deep networks trainable.

6.In NumPy: a tiny autograd node

Every "do I really get it?" check comes down to writing the two rules (multiply-through and add-at-a-fork) as code. Here is the heart of a micrograd-style Value, with the multiply node's backward closure made explicit:

micrograd_core.pyforward stores values; backward multiplies & accumulates

import numpy as np

class Value:
    def __init__(self, data, _children=()):
        self.data = data            # forward value (we MUST store it)
        self.grad = 0.0             # dL/d(this), accumulates
        self._backward = lambda: None
        self._prev = set(_children)

    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other))
        def _backward():
            # local grad of a*b is the OTHER input; chain with upstream out.grad
            self.grad  += other.data * out.grad   # += handles forks
            other.grad += self.data  * out.grad
        out._backward = _backward
        return out

    def __add__(self, other):
        out = Value(self.data + other.data, (self, other))
        def _backward():
            self.grad  += 1.0 * out.grad          # local grad of a+b is 1
            other.grad += 1.0 * out.grad
        out._backward = _backward
        return out

    def backward(self):
        topo, seen = [], set()
        def build(v):
            if v not in seen:
                seen.add(v); [build(c) for c in v._prev]; topo.append(v)
        build(self)
        self.grad = 1.0                            # dL/dL = 1
        for v in reversed(topo):                   # right-to-left sweep
            v._backward()

Check it against the simulator: with x=Value(3), y=Value(1), and f=(x+y)*(x-y), calling f.backward() gives x.grad == 6 and y.grad == -2. The += is doing the fork-summation for you.

7.Quick check

Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.

Primary source to read next: Karpathy's micrograd (a 100-line autograd engine; you just saw its core) and the CS231n backprop notes. For visual intuition, 3Blue1Brown's Backpropagation calculus. Stuck on the fork-summation or the 2-2-1 numbers? Ask me and we'll grind through every gradient by hand.