Vision · Week 5 · Session 1

Transfer Learning & Applications

You have 300 photos and a model someone trained on 1.2 million. Here is how to borrow what it already knows.

~17 min read Exam weight: match-the-following and the param-count derivation Builds on CNN architectures & ResNet

A modern CNN like ResNet has tens of millions of weights, and the ImageNet model you can download for free was trained on about 1.2 million labelled images over weeks of GPU time. Your actual task has 300 flower photos. Training those millions of weights on 300 images would memorise the noise and generalise to nothing. Transfer learning is the fix: keep the pretrained weights, since they already encode edges, textures, and shapes, and only learn the small part that is specific to your task. The whole skill is deciding which layers to keep frozen, written as $\text{requires\_grad} = \text{False}$, and which to let move.

1.The problem: too many knobs, too few examples

Here is the trap, in numbers. ResNet-18 has about 11.7 million parameters. If you train all of them from scratch on a few hundred images, the model has far more free knobs than facts to constrain them. It fits the training set perfectly and the test set badly. That is textbook overfitting, and no amount of dropout fully saves you when the parameter-to-example ratio is that lopsided.

But notice something. The reason ImageNet models took weeks to train was not the classifier on top. It was learning good features: the convolutional filters that turn raw pixels into useful descriptions. Those filters are not specific to ImageNet's 1000 classes. An edge is an edge whether it borders a cat or a tumour.

⊳ First principles

A trained CNN is two things stacked: a feature extractor (the convolutional body that maps pixels to a vector) and a classifier head (the final layer that maps that vector to class scores). The expensive, reusable knowledge lives in the body. The head is cheap and task-specific. Transfer learning keeps the body and swaps the head. You inherit millions of dollars of compute for the price of training one new layer.

2.Early layers are generic, late layers are specific

Why can the body transfer at all? Because of what the layers learn, in order. Yosinski and colleagues studied this directly in 2014. The first convolutional layer of almost any vision network, no matter the dataset, converges to the same kind of filters: oriented edge detectors and colour blobs, close cousins of Gabor filters in the visual cortex. These are general. Move deeper and the filters combine edges into textures, textures into parts, parts into object-like patterns. By the last layers, the features are tuned to separate the exact classes the model was trained on. Those are specific.

So transferability falls off with depth. An edge detector helps any vision task. A "this looks like an ImageNet terrier" detector helps only tasks that care about terriers. The same paper found a second, less obvious effect: layers in the middle are co-adapted, meaning a layer's features depend on the precise features below it. Cut the network in the middle and the halves no longer fit, which is its own source of trouble when you freeze part-way. The clean takeaway holds: freeze the generic early layers, and reconsider the specific late ones for your task.

Generality is a gradient, not a switch: layer 1 transfers to almost anything, the final layer transfers to almost nothing, and the useful decisions happen in between.
◆ Intuition

Think of a chef trained in French cooking who moves to an Italian kitchen. The knife skills, the heat control, the sense for when something is done, all transfer untouched. Only the recipes change. Early CNN layers are the knife skills. The classifier head is the recipe. You do not retrain the chef's hands to chop an onion.

3.Two strategies: feature extraction and fine-tuning

That split gives you exactly two moves, plus a dial between them.

Feature extraction. Freeze the entire pretrained body. Forward-pass your images through it to get feature vectors, then train only a fresh classifier head on top. In code this is one line of intent: set $\text{requires\_grad} = \text{False}$ on every backbone parameter so no gradients are computed for them in the backward pass. The body still runs forward and still produces features. It just never updates. Because gradients skip most of the network, this also trains in roughly half the wall-clock time.

Fine-tuning. Unfreeze some of the later layers and let them keep training, alongside the new head. You are not starting those layers from random weights, you are nudging already-good weights toward your task. This adapts the specific features when your task differs from ImageNet, but it needs more data to be safe, because you have unlocked more parameters.

▲ Exam-favourite numbers

Fine-tuning uses a much smaller learning rate on pretrained layers, typically 10x to 100x smaller than for the freshly initialised head. The PyTorch tutorial fine-tunes with SGD at lr $=0.001$, momentum $0.9$. The reason: a large step on pretrained weights wipes out the ImageNet knowledge in the first few updates. That failure has a name, catastrophic forgetting. A small step adapts the features gently instead of erasing them. This "differential learning rate" idea, smaller LR for earlier and more generic layers, is a standard exam match-the-following item.

4.The 2x2 rule: dataset size by domain similarity

How do you choose? CS231n's guidance reduces to two questions. Is your dataset small or large? Is your domain similar to or different from ImageNet's natural photos? That makes four cells.

Similar domainDifferent domain
Small dataFeature extraction. Freeze all, train a head. Fine-tuning would overfit.Hard case. Freeze, but train the head on features from an earlier layer (less specific).
Large dataFine-tune the whole network. Enough data to move the weights safely.Fine-tune broadly, or even consider training from scratch since the gap is wide.

Start with the diagonal. Small and similar, the handout's 300 flowers for 5 species, calls for feature extraction: 300 images is too little to fine-tune 11.7 million weights, and flowers are close enough to ImageNet that the frozen features already work. Large and different lets you fine-tune aggressively because the data can support it.

The off-diagonal "small and different" cell is the genuinely awkward one, an X-ray model built from a photo backbone. Frozen photo features are not ideal for X-rays, but you lack the data to fine-tune safely. The trick is to pull features from an earlier, more generic layer, where photos and X-rays still share edges, and accept that a frozen pretrained model on a far domain may simply underperform. When a fine-tuned ImageNet model does poorly on X-rays, the phenomenon is called domain shift, and the simplest fix is to fine-tune on even a small labelled set from the target domain.

5.See the trade-off: freeze or fine-tune

The simulator below is a pretrained backbone drawn as a vertical stack, generic early layers at the bottom, task-specific layers and the new head at the top. Click a layer to lock or unlock it. Locked layers are frozen, unlocked layers train. Set the dataset size and the domain similarity, then read the recommendation, the count of trainable parameters, and the overfitting-risk note. Try to make the strategy panel agree with the 2x2 table above.

Notice the tension the readout makes concrete. Unlocking more layers adds millions of trainable parameters. With a small dataset that pushes the overfitting risk straight to high, which is why the recommendation flips to feature extraction. The numbers below the canvas are the whole argument of this lesson in one panel.

6.In NumPy: a frozen backbone and a trained head

You will not implement ResNet by hand, but the mechanics of freezing are short enough to see exactly. The point of "freeze" is that frozen layers run forward and produce features, but their weights get no gradient and never update. Only the head learns. Here is the loop, stripped to its bones.

transfer.pyfrozen backbone, trained head
import numpy as np

# A "pretrained" backbone: fixed weights we will NOT update.
# Maps a flattened image x (D-dim) to a feature vector (F-dim).
rng = np.random.default_rng(0)
W_backbone = rng.standard_normal((D, F)) * 0.05   # frozen: requires_grad = False

def backbone(x):
    # forward pass through frozen body; ReLU features
    return np.maximum(0.0, x @ W_backbone)        # shape (N, F)

# A fresh classifier head we DO train (F features -> C classes).
W_head = np.zeros((F, C))
b_head = np.zeros(C)

def softmax(z):
    z = z - z.max(axis=1, keepdims=True)
    e = np.exp(z); return e / e.sum(axis=1, keepdims=True)

def train_head(X, y_onehot, lr=0.1, epochs=200):
    global W_head, b_head
    feats = backbone(X)                            # backbone is FROZEN: computed once
    N = X.shape[0]
    for _ in range(epochs):
        probs = softmax(feats @ W_head + b_head)   # forward through head
        dz = (probs - y_onehot) / N                # softmax + cross-entropy gradient
        # gradients ONLY for the head; the backbone gets none
        W_head -= lr * (feats.T @ dz)
        b_head -= lr * dz.sum(axis=0)
    return W_head, b_head

Two things to read off the code. First, because the backbone is frozen, feats is computed once outside the loop, never recomputed, never updated: that is the speedup feature extraction buys you. Second, only W_head and b_head receive a gradient. For a 512-dimensional feature vector and 5 flower classes, the head has $512 \times 5 + 5 = 2565$ trainable parameters, against 11.7 million if you fine-tuned everything. That gap is why 300 images is plenty for the head and nowhere near enough for the full network.

7.Quick check

Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.

Primary source to read next: the CS231n transfer-learning notes for the 2x2 decision rule, and Yosinski et al. 2014, "How transferable are features in deep neural networks?" for the generic-to-specific layer evidence. For hands-on code, the PyTorch transfer-learning tutorial shows freezing with requires_grad = False. Want to grind the parameter-budget derivation or the 2x2 cells by hand? Ask me and we will work each case.