Reference

Glossary

Every term in the course, defined the way the exam expects, linked to the lesson where it lives.

Activation function: A non-linear function applied to a neuron's weighted sum. Without it, stacking layers collapses to a single linear map. Common choices: sigmoid, tanh, ReLU. See Perceptron to MLP.
Adam: An optimizer that combines momentum (a running average of gradients, the first moment) with RMSProp (a running average of squared gradients, the second moment) to give each parameter its own adaptive step size. See Loss & Optimization.
AlexNet: The 2012 CNN that won ImageNet and started the modern era. Its win came from combining a deep architecture, ReLU, GPU training, dropout, and a large dataset, none of which was new on its own.
Attention: A mechanism that lets a model focus on the most relevant parts of its input by computing a weighted average over them. It removed the fixed-context bottleneck of plain seq2seq. See Seq2Seq & Attention.
Automatic differentiation (autograd): Computing exact gradients by recording the forward operations as a graph and applying the chain rule backward through it. PyTorch's loss.backward() does this. See Backpropagation.
Backpropagation: The algorithm that computes the loss gradient for every weight in one backward sweep of the computational graph, multiplying upstream by local gradients and summing at forks. See Backpropagation.
Batch normalization: Normalizes each layer's pre-activations across the mini-batch to roughly zero mean and unit variance, then rescales with learnable parameters. It stabilizes training and allows higher learning rates. See Initialization & Normalization.
BERT: A Transformer encoder pre-trained with masked language modelling, so it reads context bidirectionally. Good for understanding tasks. See Transformers in Practice.
Bidirectional RNN: Two RNNs reading a sequence in opposite directions, whose hidden states are combined, so each position sees both past and future context.
Byte-Pair Encoding (BPE): A subword tokenizer that starts from characters and repeatedly merges the most frequent adjacent pair, so common words become single tokens and rare ones split into pieces. See Transformers in Practice.
Backpropagation through time (BPTT): Backprop applied to an RNN unrolled across timesteps. Because the same weights recur, gradients are products over many steps, which is why long sequences cause vanishing or exploding gradients. See RNNs.
Chain rule: The calculus rule that the derivative of a composition is the product of the derivatives of its parts. It is the engine of backprop.
Computational graph: A directed acyclic graph whose nodes are operations and edges carry values. Forward pass evaluates left to right; backward pass propagates gradients right to left.
Convolution: Sliding a small learnable filter across an input and computing dot products, which gives local connectivity and weight sharing. Output size is $\lfloor (W - F + 2P)/S \rfloor + 1$. See Convolutions.
Cross-entropy loss: The standard classification loss, $-\sum_i y_i \log \hat{y}_i$. Paired with softmax it gives clean, strong gradients when the prediction is wrong. See Loss & Optimization.
Decoder: The half of a seq2seq or Transformer model that generates the output sequence one token at a time, conditioned on the encoder's representation.
Dropout: Randomly zeroing a fraction $p$ of activations during training so the network cannot rely on any single unit. At test time activations are scaled by $1/(1-p)$ (inverted dropout) so no change is needed. See Regularization.
Early stopping: Watching the validation loss during training and keeping the checkpoint from just before it starts rising, which prevents overfitting. See Regularization.
Encoder: The half of a model that reads the input sequence and turns it into a representation the decoder can use.
Exploding gradient: Gradients that grow without bound as they propagate back through many layers or timesteps, usually from repeated multiplication by large values. Gradient clipping is the common fix.
Fine-tuning: Taking a pre-trained model and continuing training its weights on a new task, usually the later layers. Best when you have enough data. Contrast with feature extraction (freeze the backbone, train only a new head). See Transfer Learning.
Forward pass: Running inputs through the network to produce a prediction, storing the intermediate values that the backward pass will need.
Feature map: The output of applying one convolutional filter across an input. A conv layer produces one feature map per filter.
GRU: A gated recurrent unit, a lighter alternative to the LSTM with two gates (update and reset) and no separate cell state. See LSTMs & GRUs.
Gradient checking: Verifying an analytic gradient against a numerical finite-difference estimate, $[f(x+\epsilon) - f(x-\epsilon)]/(2\epsilon)$. A relative difference near $10^{-7}$ means the backprop code is likely correct. See Debugging.
Gradient descent: The update rule $w \leftarrow w - \eta \, \partial L/\partial w$ that steps each weight downhill on the loss.
He initialization: Weight init with variance $2/\text{fan\_in}$, designed for ReLU networks so activation variance stays stable across depth. See Initialization.
Hidden state: An RNN's running memory $h_t$, updated at each timestep from the previous state and the current input.
Inception (GoogLeNet): A CNN that runs several filter sizes in parallel within a block and uses $1\times1$ convolutions to reduce channel depth cheaply. See CNN Architectures.
Kernel (filter): The small learnable weight grid in a convolution. Its weights are shared across every spatial position, so the parameter count is independent of input size.
LeNet: Yann LeCun's late-1990s CNN for digit recognition, the template of conv plus pooling plus fully connected layers.
Learning rate: The step size $\eta$ in gradient descent. Too large and training diverges; too small and it crawls. Schedules like ReduceLROnPlateau lower it over time.
LSTM: A recurrent cell with a protected cell state and three gates (forget, input, output). The cell state acts as a gradient highway, which is why LSTMs handle long sequences far better than vanilla RNNs. See LSTMs & GRUs.
Loss function: A single number measuring how wrong a prediction is. Training minimizes it. Cross-entropy for classification, MSE for regression.
Momentum: Adding a velocity term that accumulates past gradients, which dampens oscillation across steep directions and speeds movement along consistent ones. See Optimization.
Multi-layer perceptron (MLP): A stack of fully connected layers with non-linear activations. One hidden layer is enough to separate XOR. See Perceptron to MLP.
Multi-head attention: Running several attention operations in parallel on different learned projections, so the model attends to different kinds of relationships at once. See The Transformer.
Mean squared error (MSE): The regression loss, the average of squared differences between prediction and target.
Normalization: Rescaling values to a standard range or distribution. Input normalization, batch norm, and layer norm all stabilize and speed up training.
Optimizer: The rule that turns gradients into weight updates: SGD, SGD with momentum, RMSProp, Adam.
Overfitting: When a model memorizes the training set, so training loss keeps dropping while validation loss rises. Regularization fights it. See Regularization.
Padding: Adding border pixels (often zeros) around a conv input so the output keeps a chosen size. "Same" padding preserves spatial size.
Parameter: A learnable number in the model: a weight or a bias. A dense layer from $m$ to $n$ units has $m \times n$ weights and $n$ biases.
Perceptron: A single neuron with a step activation. It can only draw one straight decision boundary, so it cannot solve XOR. See Perceptron to MLP.
Pooling: Downsampling a feature map by taking the max or average over small windows, which shrinks spatial size and adds a little translation tolerance.
Positional encoding: Information about token order added to Transformer inputs, since self-attention itself is order-agnostic. The original uses fixed sinusoids. See The Transformer.
Pre-training: Training a model on a large, generic task first, so it can later be fine-tuned on a smaller specific one.
Query, Key, Value (Q/K/V): Three learned projections of the input in attention. A query is matched against all keys (dot product, scaled, softmaxed) to weight the values. See The Transformer.
Receptive field: The region of the input that influences one output unit. Stacking two $3\times3$ convs gives an effective $5\times5$ field with fewer parameters than one $5\times5$.
Regularization: Any technique that reduces overfitting: L2 weight decay, dropout, early stopping, data augmentation, batch norm. See Regularization.
ReLU: The rectified linear unit, $\max(0, x)$. Its gradient is 1 for positive inputs, so it does not vanish, which made very deep networks trainable. Leaky ReLU keeps a small slope for negatives to avoid dead units.
Representation learning: Letting a model discover useful features from raw data instead of hand-crafting them. Deep nets learn a hierarchy: edges, then textures, then parts, then objects. See Why Deep Learning.
ResNet (skip connection): A network where each block learns a residual $F(x)$ and adds back the input, $H(x) = F(x) + x$. The shortcut fixes the degradation problem and lets gradients reach early layers. See CNN Architectures.
RNN: A recurrent network that processes a sequence step by step, carrying a hidden state. Plain RNNs struggle with long-range dependencies because of vanishing gradients. See RNNs.
Self-attention: Attention where queries, keys, and values all come from the same sequence, so every token can look at every other token in one step. The core of the Transformer. See The Transformer.
Seq2seq: An encoder-decoder model that maps an input sequence to an output sequence, as in translation. The plain version crushes the input into one fixed vector, the bottleneck attention solved. See Seq2Seq & Attention.
Stochastic gradient descent (SGD): Gradient descent that estimates the gradient from a small random mini-batch each step, which is faster and adds helpful noise.
Sigmoid: The squashing function $1/(1+e^{-x})$, output in $(0,1)$. Its gradient maxes at 0.25 and saturates, so deep sigmoid stacks suffer vanishing gradients.
Softmax: Turns a vector of scores into a probability distribution by exponentiating and normalizing. Used for multi-class outputs and attention weights.
Stride: How far a conv filter moves between positions. A larger stride downsamples the output.
Tanh: The hyperbolic tangent, output in $(-1,1)$, zero-centred. Still saturates, so it also vanishes in very deep nets, but less than sigmoid.
Tokenization: Splitting text into the units a model reads. Subword schemes (BPE, WordPiece) balance vocabulary size against handling rare words. See Transformers in Practice.
Transfer learning: Reusing a model trained on one task as the starting point for another. Feature extraction freezes the backbone; fine-tuning updates it. See Transfer Learning.
Transformer: An architecture built entirely from self-attention and feed-forward layers, no recurrence. It parallelizes over a sequence and handles long-range dependencies directly. See The Transformer.
Universal Approximation Theorem: A network with one hidden layer and enough non-linear units can approximate any continuous function on a compact set. It guarantees a solution exists, not that training will find it or that it needs few neurons. See Why Deep Learning.
Vanishing gradient: Gradients that shrink toward zero as they propagate back through many layers, so early layers barely learn. Caused by saturating activations like sigmoid; eased by ReLU, normalization, and skip connections.
VGG: A CNN that showed depth plus small $3\times3$ filters works well, at the cost of many parameters. See CNN Architectures.
Weight decay (L2): Adding a penalty proportional to the sum of squared weights to the loss, which keeps weights small and reduces overfitting. See Regularization.
WordPiece: A subword tokenizer used by BERT, similar to BPE but choosing merges that most increase the training-data likelihood.
Xavier (Glorot) initialization: Weight init with variance $1/\text{fan\_in}$ or $2/(\text{fan\_in}+\text{fan\_out})$, designed for tanh and sigmoid so signal variance stays stable across layers. See Initialization.
XOR problem: The classic example a single perceptron cannot solve, because the two classes are not linearly separable. One hidden layer makes them separable. See Perceptron to MLP.

A

B

C

D

E

F

G

H

I

K

L

M

N

O

P

Q

R

S

T

U

V

W

X