Perceptron to MLP
How one straight line grows into a network that can carve any shape.
A single neuron draws one straight line and sorts the world into two halves. That is the whole model, and it is exactly logistic regression: a weighted sum $z = \mathbf{w}\cdot\mathbf{x} + b$ squashed by an activation. The trouble is that some patterns cannot be split by any straight line. The smallest such pattern, XOR, sank the field for seventeen years. The escape is almost embarrassingly small: add one hidden layer, and a line becomes a region.
1.One neuron is logistic regression with a new name
A neuron takes a vector of inputs, multiplies each by a weight, adds a bias, and passes the result through an activation. Two lines of math:
$$z = \mathbf{w}\cdot\mathbf{x} + b = \sum_i w_i x_i + b, \qquad a = \phi(z)$$
If $\phi$ is the sigmoid, $a$ is a probability and you have built logistic regression. The weights say how much each feature pushes toward class 1, and the bias shifts the threshold. Nothing here is new from Classical ML. The new name (neuron) only matters once you start stacking these units.
The geometry is the part to hold onto. The set of inputs where $z = 0$ is a flat boundary: a line in 2D, a plane in 3D, a hyperplane in general. On one side the neuron fires high, on the other side low. One neuron, one straight cut [CS231n].
Why a weighted sum at all? Because a dot product $\mathbf{w}\cdot\mathbf{x}$ measures alignment: it is largest when the input points the same way as the weight vector. So a neuron is asking one question, "how much does this input look like my weight pattern?", and the bias sets how strong that match has to be before the neuron commits to "yes".
2.Activations and why they exist
The activation $\phi$ does two jobs. It can squash an unbounded sum into a useful range, and it bends the function so a network can be more than a line. Four show up on every exam.
| Activation | Formula | Range | Note |
|---|---|---|---|
| Step (Heaviside) | $1$ if $z>0$ else $0$ | $\{0,1\}$ | Rosenblatt's 1958 perceptron used this; not differentiable, so no gradient. |
| Sigmoid | $\sigma(z)=\dfrac{1}{1+e^{-z}}$ | $(0,1)$ | Reads as a probability. Saturates at both ends. |
| Tanh | $\tanh(z)=2\sigma(2z)-1$ | $(-1,1)$ | Zero-centered, which helps the next layer. Still saturates. |
| ReLU | $\max(0,z)$ | $[0,\infty)$ | The default for hidden layers. Cheap, and its gradient does not vanish for $z>0$. |
Sigmoid and tanh share a flaw that matters once you go deep. Out at the tails, the curve goes flat, so its slope is near zero. The sigmoid's slope tops out at just $0.25$, right at $z=0$, and falls off fast on either side [CS231n]. A flat slope means a small gradient, which is the seed of the vanishing-gradient problem you meet in lesson 3.
ReLU sidesteps that. For any positive input its slope is exactly $1$, so the gradient passes through untouched. Krizhevsky and colleagues measured a roughly 6x speedup in convergence over tanh on ImageNet, and ReLU has been the hidden-layer default since [CS231n]. The cost is the dying-ReLU failure: if a neuron's input stays negative for every example, its output and its gradient are both stuck at zero, and it never recovers. With a careless learning rate, as much as 40% of a network can go dead this way [CS231n].
Pick the activation by where it sits. Output layer of a binary classifier: sigmoid, because you want a probability in $(0,1)$. Hidden layers: ReLU, because you want gradients to flow and you do not care that the value is unbounded. The handout's one-line rule is exactly this.
3.Without a non-linearity, depth buys nothing
Here is the result that explains why activations are not optional. Stack two linear layers with no activation between them. Layer one computes $\mathbf{h}=W_1\mathbf{x}$, layer two computes $\mathbf{y}=W_2\mathbf{h}$. Substitute:
$$\mathbf{y} = W_2(W_1\mathbf{x}) = (W_2 W_1)\,\mathbf{x} = W'\mathbf{x}$$
The two matrices collapse into one matrix $W'$ [CS231n]. A hundred linear layers is still one linear map, still one straight boundary. Matrix multiplication is associative, so you can always fold the product down to a single matrix. Depth without non-linearity is a very expensive way to do logistic regression.
Drop a non-linear $\phi$ between the layers and the collapse breaks. $W_2\,\phi(W_1\mathbf{x})$ cannot be rewritten as one matrix times $\mathbf{x}$, so the second layer now operates on a bent version of the input. That bend is what lets the network represent boundaries no single line could.
4.The forward pass: multiply, add, activate, repeat
An MLP is layers of neurons, where each layer's output feeds the next. Running an input through to a prediction is the forward pass, and it is the same three steps at every layer:
$$\mathbf{z}^{(\ell)} = W^{(\ell)}\mathbf{a}^{(\ell-1)} + \mathbf{b}^{(\ell)}, \qquad \mathbf{a}^{(\ell)} = \phi\!\left(\mathbf{z}^{(\ell)}\right)$$
with $\mathbf{a}^{(0)} = \mathbf{x}$. Multiply by the weight matrix, add the bias vector, apply the activation, hand the result to the next layer. That loop, repeated layer by layer, is all a network does to predict.
The exam likes to test the bookkeeping. A fully connected layer from $n_{\text{in}}$ inputs to $n_{\text{out}}$ outputs has a weight matrix of shape $n_{\text{out}}\times n_{\text{in}}$, so $n_{\text{in}}\cdot n_{\text{out}}$ weights, plus one bias per output, giving $n_{\text{out}}$ biases.
OA-1 asks these straight. A layer with 512 inputs and 256 outputs has $512\times 256 = 131072$ weights and $256$ biases. A 256-64-10 MLP has $256\times 64 + 64\times 10 = 17024$ weights (biases excluded). For a 3-4-2 net counting biases too: $(3\cdot 4 + 4\cdot 2)$ weights $+ (4+2)$ biases $= 20 + 6 = 26$ parameters. ReLU of a negative number is $0$; Leaky ReLU with $\alpha=0.01$ at $z=-4$ is $-0.04$.
5.XOR: the four points one line cannot split
XOR outputs 1 when exactly one input is on. Four points, two classes:
| $x_1$ | $x_2$ | XOR |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
Plot them. The two 1-class points $(0,1)$ and $(1,0)$ sit on one diagonal; the two 0-class points $(0,0)$ and $(1,1)$ sit on the other. No single straight line can put the two diagonals on opposite sides, because each diagonal threads between the other's points. XOR is not linearly separable, so a single perceptron, which can only draw a line, cannot solve it [Wikipedia].
In 1969, Minsky and Papert proved exactly this in their book Perceptrons. The proof was narrow (it was about a single layer) but the field read it as a verdict on neural networks in general, funding dried up, and the workable fix, hidden layers trained by backpropagation, did not arrive until 1986. A seventeen-year gap over a problem a child solves by inspection.
Try to beat it yourself in the simulator below. On Single perceptron you get three sliders for one line. You will always strand at least one point on the wrong side.
Flip the simulator to the MLP. It uses Goodfellow's exact two-ReLU solution: hidden weights $W=\begin{bmatrix}1&1\\1&1\end{bmatrix}$, hidden bias $\mathbf{c}=[0,-1]$, output weights $\mathbf{w}=[1,-2]$, output bias $b=0$, so $f(\mathbf{x}) = \mathbf{w}^\top\max\{0,\,W^\top\mathbf{x}+\mathbf{c}\}$. The four points map to outputs $0,1,1,0$: exactly XOR [Goodfellow 6.1].
6.One hidden layer turns the line into a region
The hidden layer does not classify XOR. It reshapes the input space so that a later line can. Each hidden neuron draws its own line; the output neuron combines them.
Trace Goodfellow's construction by hand. Hidden unit $h_1$ computes $\max(0,\,x_1+x_2)$ and unit $h_2$ computes $\max(0,\,x_1+x_2-1)$. Run the four points through:
| $(x_1,x_2)$ | $h_1=\max(0,x_1{+}x_2)$ | $h_2=\max(0,x_1{+}x_2{-}1)$ | $y=h_1-2h_2$ |
|---|---|---|---|
| $(0,0)$ | $0$ | $0$ | $0$ |
| $(0,1)$ | $1$ | $0$ | $1$ |
| $(1,0)$ | $1$ | $0$ | $1$ |
| $(1,1)$ | $2$ | $1$ | $0$ |
Watch what happened in hidden space. The two middle points $(0,1)$ and $(1,0)$ both land at the same hidden coordinate $(1,0)$. The ReLU folded the input so the troublesome points stack on top of each other, and now a single line through hidden space separates the classes [Goodfellow 6.1]. The output neuron just reads off that line.
This generalizes. The Universal Approximation Theorem (Cybenko 1989, Hornik 1991) says one hidden layer with enough units and a non-linear activation can approximate any continuous function on a closed, bounded region to any accuracy [Wikipedia]. Two cautions the exam loves. The theorem promises a network exists, not that gradient descent will find it. And "enough units" can mean an exponentially wide layer, which is the practical reason we add depth instead of width.
- Neuron: $a=\phi(\mathbf{w}\cdot\mathbf{x}+b)$. One straight boundary. Sigmoid version is logistic regression.
- Activation: the non-linearity. Without it, stacked layers collapse to one matrix and depth is wasted.
- Forward pass: multiply by $W$, add $\mathbf{b}$, apply $\phi$, repeat per layer.
- XOR: not linearly separable, so one perceptron fails. One hidden layer of 2 ReLU units reshapes the space and wins.
- Why deep: depth is how a network builds shapes out of lines without an impossibly wide single layer.
7.In NumPy: a neuron, then the XOR MLP
The check on understanding is writing it from scratch. First a single neuron, with the handout's exact numbers so you can verify by hand, then the two-ReLU XOR network whose outputs you just traced.
import numpy as np
def relu(z): return np.maximum(0.0, z)
def sigmoid(z): return 1.0 / (1.0 + np.exp(-z))
# ---- one neuron: handout Exercise 1 -----------------------------------
# 3 inputs, 3 weights, a bias, ReLU. Verify the number by hand.
x = np.array([2.0, -1.0, 0.5])
w = np.array([0.5, -0.3, 0.8])
b = 0.1
z = w @ x + b # 1.0 + 0.3 + 0.4 + 0.1 = 1.8
print(relu(z)) # -> 1.8
# ---- the XOR MLP: Goodfellow's exact 2-ReLU solution ------------------
# f(x) = w2 . relu(W1 @ x + c) + b2, hidden layer = 2 ReLU units.
W1 = np.array([[1.0, 1.0], # row i = weights of hidden unit i
[1.0, 1.0]])
c = np.array([0.0, -1.0]) # hidden biases
w2 = np.array([1.0, -2.0]) # output weights
b2 = 0.0
def xor_net(x):
h = relu(W1 @ x + c) # 2 hidden activations
return w2 @ h + b2 # scalar output
X = np.array([[0,0], [0,1], [1,0], [1,1]], dtype=float)
print([xor_net(p) for p in X]) # -> [0.0, 1.0, 1.0, 0.0] == XOR
The two middle rows of $X$ both produce hidden activation $[1,0]$, which is the fold that makes the classes separable. Strip out the ReLU (use the raw $W_1\mathbf{x}+c$) and the outputs become $0,-1,-1,-2$, a single linear map that cannot match XOR. The non-linearity is doing the work.
8.Quick check
Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.