Foundations · Week 1 · Session 1

Why deep learning?

When you stop designing features by hand and let the network grow its own.

~16 min read Exam weight: opens OA-1 Week 1 and the match-the-following bank Prereq: Classical ML, feature engineering

For decades, the hard part of machine learning was not the model. It was deciding what to feed it. An engineer would stare at images and hand-write code to measure edges, corners, and colour histograms, then hand those numbers to a classifier. That step is called feature engineering, and on raw images, audio, and text it hits a wall. Deep learning removes the step entirely: the network learns its own features straight from pixels. That one swap, from designing the representation to learning it, is what this whole field is built on.

1.The wall: hand-crafted features

A classical vision pipeline has two stages. First a fixed, human-written function turns the raw input into a vector of numbers: SIFT for keypoints, HOG for gradient histograms, bag-of-words for text. Then a classifier (an SVM, a random forest) learns to separate those vectors. The classifier trains; the feature extractor does not. It is frozen the moment the engineer ships it.

On a spreadsheet this works fine. Columns like age, income, and blood pressure already are meaningful features, so a tree or a linear model has everything it needs. The trouble starts with raw perception. A 224x224 colour image is about 150,000 raw numbers, and no single pixel means anything on its own. The signal lives in how pixels combine into edges, edges into shapes, shapes into objects. A human writing one fixed formula cannot anticipate every way a cat can sit, turn, or hide in shadow.

⊳ First principles

A hand-crafted feature is a guess, frozen in code, about what matters in the data. If the guess misses something, no amount of extra data can fix it: the classifier only ever sees the numbers the extractor chose to emit. The features become a ceiling. The model can climb up to them and no higher.

2.Why classical ML plateaus and deep nets do not

Here is the pattern Andrew Ng drew on a whiteboard so often it has a name, the scaling plot. Put accuracy on the y-axis and the size of the training set on the x-axis. A classical model on fixed features rises quickly with a little data, then flattens. A deep network starts lower (it needs more data to find its footing) but keeps rising as you pour in more examples, and eventually crosses over the classical curve.

The reason is the ceiling from section 1. Once the classifier has learned everything the frozen features can express, extra data tells it nothing new. The deep network has no such ceiling: its features are parameters, so more data improves the representation itself, not just the decision boundary. Drag the slider in the simulator and watch the crossover appear.

The crossover point is the whole argument for deep learning in one picture. Below it, with little data and good hand features, classical ML can win and is cheaper. Above it, on large messy datasets, the learned representation pulls ahead and stays ahead.

◆ Intuition

Think of fixed features as a fixed-size cup. You can fill it, but you cannot make it bigger by adding water. A deep network is a cup that grows as you pour. The first few litres look wasteful, then it overtakes the small cup and never looks back.

3.Representation learning: features that build on features

What does a deep network actually learn inside? Stack layers, and each one transforms the output of the layer below. Train the whole stack end to end on a classification loss, and a hierarchy emerges on its own. Nobody programs it. The standard picture, confirmed by visualizing real trained networks, is a ladder of abstraction:

Layer 1 learns oriented edges and colour blobs, the same Gabor-like filters the human visual cortex uses.
Middle layers combine edges into textures and simple motifs (corners, stripes, repeated patterns).
Later layers assemble motifs into object parts (an eye, a wheel, a door handle).
The top combines parts into whole objects (a face, a car, a dog).

This is why the slogan for deep learning is representation learning: the network discovers, at multiple levels of abstraction, the features a human used to write by hand. Step through the ladder in the second panel below to see each rung named.

Two consequences fall out of this. First, the features transfer. Edge and texture detectors learned on everyday photos are reusable for medical scans, because low-level vision is shared across domains. Second, the network bends to the task end to end, so the representation is tuned for exactly the labels you care about, not a generic guess. The visualization work behind this picture is Zeiler and Fergus, 2013 (arxiv.org/abs/1311.2901).

Deep learning is feature engineering done by gradient descent instead of by a person.

4.The theory: a network can approximate anything

Why believe a stack of simple layers can express something as rich as "is this a cat"? The Universal Approximation Theorem gives a guarantee. A feedforward network with a single hidden layer, enough neurons, and a non-linear (non-polynomial) activation can approximate any continuous function on a compact set to any accuracy you want (Universal approximation theorem, Wikipedia).

Formally, for any continuous target $f$ on a closed bounded region and any tolerance $\varepsilon > 0$, there is a one-hidden-layer network $g$ with

$$g(x) \;=\; \sum_{i=1}^{N} c_i\,\sigma\!\left(w_i^\top x + b_i\right), \qquad \sup_x \,\lvert f(x) - g(x)\rvert < \varepsilon.$$

The non-linearity is the load-bearing part. Without $\sigma$, stacking layers just multiplies matrices, and a product of matrices is one matrix: the whole network collapses to a single linear map and can only draw straight decision boundaries.

▲ Read the small print

The UAT is an existence result. It promises a good network exists. It says nothing about how many neurons you need (could be astronomically many), nothing about whether gradient descent will find those weights, and nothing about how much data you need to train it. A common exam trap is "the UAT proves neural nets beat SVMs." It proves no such thing. It only says a solution exists.

So if one hidden layer is already universal, why go deep? Because width and depth are not equal in cost. Many useful functions are compositional: an object is parts, parts are motifs, motifs are edges. A deep network mirrors that structure layer by layer and can represent such functions with far fewer neurons than a shallow one would need (Goodfellow, Bengio and Courville, Deep Learning, Ch. 1). Depth buys efficiency, not new theoretical power.

5.The moment it became real: AlexNet, 2012

The ideas above were known for years. What changed in 2012 was the proof on a hard, public benchmark. AlexNet, a deep convolutional network from Krizhevsky, Sutskever, and Hinton, entered the ImageNet competition (1000 classes, over a million labelled images) and won by a stunning margin.

▲ Exam-favourite numbers

AlexNet's top-5 error was 15.3% versus the runner-up's 26.2%, a gap of about 10.8 points, unheard of in that contest. The network had 8 layers (5 convolutional + 3 fully connected) and about 60 million parameters, trained on two GPUs (AlexNet, Wikipedia; Krizhevsky et al., 2012).

The exam favourite follow-up is why was this surprising, given no single idea was new? The answer is the combination. Convolution dated to the 1980s, ReLU and dropout were known, GPUs existed. AlexNet put them together at the right scale: a deep architecture, ReLU so gradients did not vanish, dropout to fight overfitting, GPU training to make it tractable, and ImageNet to supply the data. None of the parts were the breakthrough. Integrating all of them was.

That win flipped the field. Within a few years almost every vision benchmark was led by a deep network, and the same recipe spread to speech and language.

6.When to use deep learning, and when not to

Deep learning is not the answer to everything, and the exam tests whether you know the boundary. The cleanest mental model is a 2x2 over two axes: how much data you have, and what type it is.

	Small data	Large data
Tabular / structured	Classical ML (trees, linear) wins. Cheap and interpretable.	Gradient-boosted trees are often still the top choice.
Raw images / audio / text	Transfer learning: fine-tune a pre-trained net. Training from scratch overfits.	Deep learning territory. Train a deep net end to end.

The handout's worked case: you have 500 labelled chest X-rays. That is raw image data but tiny. Training a deep net from scratch would overfit hard, so the right move is classical ML on good features, or transfer learning by fine-tuning a network already trained on a large image set. Reach for deep learning when the data is both large and unstructured, or when you can borrow a pre-trained representation. Stay classical when data is small and tabular, when you need interpretability, or when compute is tight.

7.In NumPy: watch the feature ceiling appear

The plateau is easy to feel in code. Freeze a crude feature extractor, fit a linear classifier on top, and the test accuracy stops climbing once the features run out of information, no matter how much data you add. A learnable model has no such hard cap.

feature_ceiling.pyfixed features cap accuracy; learned features do not

import numpy as np

# Toy 2D data: the real signal is the PRODUCT of the two inputs (an XOR-like rule).
rng = np.random.default_rng(0)
X = rng.uniform(-1, 1, size=(20000, 2))
y = (X[:, 0] * X[:, 1] > 0).astype(int)          # label depends on x0*x1

def fixed_features(X):
    # A frozen, hand-picked extractor that only keeps the raw coordinates.
    # It threw away the interaction term, so it can NEVER represent x0*x1.
    return X                                       # this is the ceiling

def fit_linear(F, y):
    # least-squares linear classifier on whatever features it is handed
    Fb = np.c_[F, np.ones(len(F))]                 # add a bias column
    w, *_ = np.linalg.lstsq(Fb, 2 * y - 1, rcond=None)
    return (Fb @ w > 0).astype(int)

for n in [200, 2000, 20000]:
    acc = (fit_linear(fixed_features(X[:n]), y[:n]) == y[:n]).mean()
    print(f"n={n:>6}  acc={acc:.3f}")             # stuck near 0.50: features are the cap

# Now give the SAME model one extra LEARNED-style feature: the interaction x0*x1.
better = np.c_[X, X[:, 0] * X[:, 1]]
acc = (fit_linear(better, y) == y).mean()
print(f"with interaction feature  acc={acc:.3f}")  # jumps high: the ceiling moved

The first loop hovers near chance (about 0.50) and adding data does not help, because the frozen features physically cannot encode the rule. The moment the right feature is present, accuracy jumps. A deep network finds that interaction feature for you, by training, instead of waiting for an engineer to add it by hand.

8.Quick check

Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.

Primary source to read next: Goodfellow, Bengio and Courville, Deep Learning, Chapter 1 (the definitive intro to representation learning), and the original AlexNet paper. For the feature-ladder pictures, Zeiler and Fergus, Visualizing and Understanding CNNs. Want to grind the scaling-crossover or the AlexNet numbers until they stick? Ask me and we will walk the whole plot together.