Transformers · Week 7 · Session 2

Transformers in practice: BERT, GPT & tokenization

Last session you built the Transformer. Today you meet the two models that ate NLP, and they differ by one mask.

~20 min read Exam relevance: encoder vs decoder, attention masks, BPE/WordPiece, pre-train + fine-tune Builds on Seq2Seq & Attention

You spent last session assembling a Transformer block: multi-head self-attention, then a feed-forward layer, wrapped in residuals and layer norm. Here is the surprise. The two models that defined modern NLP, Google's BERT and OpenAI's GPT, are both built from that same block. The thing that splits them is a single boolean matrix inside attention, the mask that decides which tokens each token is allowed to look at. Change $M$ from a full grid to a lower triangle and an "understanding" model becomes a "generating" one.

1.The problem: one architecture, two jobs

Natural language asks for two different skills. Sometimes you want to understand a sentence that already exists: classify its sentiment, tag each word, answer a question about it. For that, every word should see every other word, including the ones after it. Sometimes you want to generate the next sentence: write a completion, translate, answer a chat prompt. For that, the model must produce one token at a time, and while writing token 5 it cannot peek at token 6, because token 6 does not exist yet.

Both skills run on self-attention, where each position builds a query and compares it against the keys of every position. The only question is which positions are allowed to answer. That gate is the attention mask.

⊳ First principles

Self-attention computes a score for every ordered pair of positions, an $n \times n$ grid. A mask is an $n \times n$ matrix of allow/block decisions added to those scores before the softmax: blocked entries get $-\infty$, so their attention weight becomes zero. BERT allows the whole grid. GPT blocks everything above the diagonal. Same attention code, one line of difference.

2.The attention mask is the fork in the road

Recall the attention you wrote last week:

$$\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)V$$

The raw scores $QK^\top/\sqrt{d_k}$ form an $n \times n$ matrix: row $i$, column $j$ is how much query $i$ wants key $j$. The mask $M$ is added in. Two choices of $M$ give the two model families:

Mask	Shape of allowed entries	Each token attends to	Family
Bidirectional (none)	Full $n\times n$ grid	every token, left and right	BERT (encoder)
Causal	Lower triangular, $M_{ij}=-\infty$ for $j>i$	itself and tokens before it	GPT (decoder)

Why does the lower triangle equal "left to right"? Because setting $M_{ij}=-\infty$ for every future position $j>i$ drives those softmax weights to exactly zero. Position $i$ can place attention only on positions $1\ldots i$. Stack that mask in every layer and information never flows backward in time, which is what lets GPT generate one token at a time without cheating.

BERT and GPT share the Transformer block. The architectural difference is the attention mask: a full grid versus a lower triangle.

You can flip between the two masks in the simulator below and watch the heatmap change from a solid block to a staircase.

3.BERT: read both ways, fill in the blank

BERT keeps only the Transformer encoder stack and uses the full bidirectional mask. That raises an awkward question for training. A plain language model is trained to predict the next word, but if every token can already see every other token, predicting the next word is trivial: it is sitting right there in the input. The model would learn nothing.

BERT's fix is masked language modeling (MLM). Hide a random slice of the input tokens, then ask the model to reconstruct them from both sides of context. Devlin et al. mask 15% of tokens, and within that 15% they use an 80/10/10 recipe: 80% are replaced with a special [MASK] token, 10% are swapped for a random word, and 10% are left unchanged.^[1] The 10% random and 10% unchanged keep the model honest, because at fine-tuning time there is no [MASK] token, so it cannot just learn "only predict where you see [MASK]".

◆ Intuition

MLM is a fill-in-the-blank exam. "The [MASK] sat on the mat" is easy because both neighbours help: an animal, probably "cat". To fill a blank well you have to understand the whole sentence, so the gradient pushes BERT to build rich, context-aware representations of every position. That is exactly the representation you want for classification and tagging.

One token is special. BERT prepends a [CLS] token to every input; after the encoder runs, the vector sitting at that [CLS] position is used as a summary of the whole sequence and fed to a classifier head.^[1] BERT also trains on next-sentence prediction (does sentence B follow sentence A?), though later work found that part adds little.

Because BERT reads both directions, it cannot generate text in order. To produce token 5 it would have to look at tokens 6 and beyond, which breaks any left-to-right writing process. BERT understands; it does not write.

4.GPT: read left to right, predict the next token

GPT keeps only the Transformer decoder stack with the causal mask. Its training objective is the oldest one in the book: predict the next token given everything before it. For a sequence $x_1,\ldots,x_n$ it maximizes

$$\log p(x) = \sum_{i=1}^{n} \log p(x_i \mid x_1, \ldots, x_{i-1})$$

The causal mask is what makes this objective trainable on a whole sequence at once. With future positions blocked, the prediction at position $i$ physically cannot use $x_{i+1}$, so a single forward pass scores every next-token prediction in parallel while still respecting the left-to-right rule. At generation time you sample a token, append it, and run again.^[2]

The story of GPT is scale. GPT-1 had 117M parameters, GPT-2 reached 1.5B, and GPT-3 jumped to 175B.^[2] At GPT-3 scale a new ability showed up that nobody coded in: in-context (few-shot) learning. Show the model two or three examples of a task inside the prompt and it performs the task on a new input, with no weight updates at all.^[3] Translation, for instance, can be triggered by a few example pairs in the prompt rather than by training on a translation dataset.

◆ Intuition

Few-shot prompting is surprising because there is no learning in the usual sense. No gradient step, no fine-tuning. The pattern in the prompt examples is recognized and continued purely by the forward pass. This is why people say scale "unlocked" abilities: the same next-token objective, run on a big enough model and enough data, produced behaviour the training loss never directly rewarded.

5.Tokenization: how text becomes numbers

A Transformer never sees letters. It sees integer IDs, one per token, looked up in an embedding table. So something has to chop text into tokens first, and that choice quietly decides what the model can even represent.

The naive options both fail. One token per word gives a huge, brittle vocabulary that breaks on any word it has not seen ("antidisestablishmentarianism" becomes an unknown). One token per character keeps the vocabulary tiny but makes sequences very long and forces the model to relearn spelling. Subword tokenization splits the difference: common words stay whole, rare words break into frequent pieces.

Byte-Pair Encoding (BPE), used by GPT, learns its vocabulary by greedy merging.^[4] Start with single characters. Count every adjacent pair across the corpus, merge the most frequent pair into a new token, and repeat until you hit a target vocabulary size. Worked example from the Hugging Face course, with word counts hug (10), pug (5), pun (12), bun (4), hugs (5): the pair ("u","g") appears 20 times, so it merges first into ug; then ("u","n") at 16 merges into un; then ("h","ug") at 15 merges into hug.^[4]

WordPiece, used by BERT, is close but differs in two ways.^[5] It marks continuation pieces with a ## prefix, so "tokenization" becomes token + ##ization and you can always tell which pieces glue back to the start of a word. And instead of merging the most frequent pair, it merges the pair with the highest score:

$$\text{score}(a,b) = \frac{\text{freq}(ab)}{\text{freq}(a)\cdot\text{freq}(b)}$$

That denominator means WordPiece prefers merging pieces that are rare on their own but common together, which avoids gluing two already-common pieces just because both are everywhere. BERT's WordPiece vocabulary holds about 30,000 tokens.^[1]

▲ Why LLMs are bad at arithmetic

The tokenizer chops numbers inconsistently. "15213" might split into ["152","13"], and "427" into ["42","7"]. The model never sees the digits 1-5-2-1-3 as separate symbols; it sees two arbitrary chunks. Since the chunks for "152" and "153" are unrelated tokens, the model cannot do digit-by-digit carrying the way you do on paper. Inconsistent number splitting is a direct cause of arithmetic errors.

6.Pre-train once, fine-tune cheaply

Both families share one workflow, and it is the reason they took over. Pre-training runs the self-supervised objective (MLM for BERT, next-token for GPT) on a giant pile of unlabeled text. This is expensive and happens once. Fine-tuning then takes those pre-trained weights, adds a small task head, and trains on a small labeled dataset for your specific task with a low learning rate and a few epochs.

◆ Intuition

Think of it as medical school then residency. Pre-training is the long, costly general education where the model learns how language works. Fine-tuning is the short specialization where it learns your task. You pay for medical school once; each residency is cheap.

This is why, given 500 labeled support emails, you fine-tune BERT rather than train a classifier from scratch. 500 labels is far too few to learn language from zero, but BERT already knows language; you only teach it your two or three categories. Training from random initialization would need on the order of 100 times more data and would overfit badly on 500 examples.

▲ Exam-favourite numbers

BERT-base: 12 layers, hidden size 768, 12 heads, about 110M parameters. BERT-large: 24 layers, hidden 1024, 16 heads, about 340M parameters.^[1] MLM masks 15% of tokens, split 80% [MASK] / 10% random / 10% unchanged. WordPiece vocab about 30,000. GPT scale: 117M (GPT-1), 1.5B (GPT-2), 175B (GPT-3). Self-attention is $O(n^2)$ in sequence length, so doubling $n$ quadruples the attention memory: the $QK^\top$ matrix is $n\times n$, and $(2n)^2 = 4n^2$.

7.Simulator: mask and tokenize

Two things to feel here. First, toggle the attention mask between BERT (bidirectional) and GPT (causal) and watch the allowed-attention grid flip between a full block and a staircase. Second, pick a word and see how WordPiece splits it into subwords using a small demo vocabulary. The flow diagram underneath traces pre-training to fine-tuning.

The causal grid is exactly the lower triangle: row $i$ lights up only for columns $j \le i$. Count the lit cells and you get $n(n+1)/2$ instead of $n^2$, which is the same quadratic growth, just halved.

8.In NumPy: the causal mask and a tiny tokenizer

The whole BERT-versus-GPT distinction is a few lines. Build the additive mask, add it to the scores, softmax, done. Here is a self-attention step with a switch for the causal mask, plus a greedy WordPiece tokenizer over a fixed demo vocabulary.

mask_and_tokenize.pyone mask flips BERT into GPT; greedy longest-match splits words

import numpy as np

def softmax(z, axis=-1):
    z = z - z.max(axis=axis, keepdims=True)
    e = np.exp(z)
    return e / e.sum(axis=axis, keepdims=True)

def self_attention(Q, K, V, causal=False):
    # Q,K,V: (n, d). causal=False -> BERT (full), causal=True -> GPT (triangle)
    n, d = Q.shape
    scores = Q @ K.T / np.sqrt(d)          # (n, n) raw affinities
    if causal:
        mask = np.triu(np.ones((n, n)), k=1).astype(bool)  # True ABOVE diagonal = future
        scores[mask] = -np.inf             # block the future; softmax sends it to 0
    weights = softmax(scores, axis=-1)     # rows sum to 1
    return weights @ V, weights

def wordpiece(word, vocab):
    # greedy longest-match from the left; continuation pieces carry "##"
    tokens, start = [], 0
    while start < len(word):
        end = len(word)
        piece = None
        while start < end:
            cand = word[start:end]
            sub = cand if start == 0 else "##" + cand
            if sub in vocab:
                piece = sub
                break
            end -= 1                        # shrink window until a piece matches
        if piece is None:
            return ["[UNK]"]                # no split found
        tokens.append(piece)
        start = end
    return tokens

# BERT sees every token; GPT at position i sees only 0..i
n = 4
Q = K = V = np.eye(n)
_, w_bert = self_attention(Q, K, V, causal=False)
_, w_gpt  = self_attention(Q, K, V, causal=True)
print("GPT row 0 attends to:", (w_gpt[0] > 0).sum(), "token(s)")   # 1
print("GPT row 3 attends to:", (w_gpt[3] > 0).sum(), "token(s)")   # 4

vocab = {"token", "##ization", "##s", "un", "##happy", "##ness"}
print(wordpiece("tokenization", vocab))   # ['token', '##ization']
print(wordpiece("unhappiness", vocab))    # ['un', '##happy', '##ness']

Notice the causal branch is the entire difference between the two model families. Everything else, the embeddings, the heads, the feed-forward layers, is shared.

9.Quick check

Four questions, rising difficulty. Answer, hit Check, read the why. The full bank lives in the Exam Center.

Primary sources to read next: Devlin et al. (2019), BERT: Pre-training of Deep Bidirectional Transformers (read Sec. 3 on pre-training and Sec. 4 on fine-tuning), Brown et al. (2020), Language Models are Few-Shot Learners (the GPT-3 in-context learning results), and Jay Alammar's visual Illustrated BERT. For tokenizers, the Hugging Face BPE and WordPiece chapters are short and worked end to end. Stuck on why the lower-triangular mask equals left-to-right, or on a WordPiece split? Ask me and we will trace a small example cell by cell.

[1] BERT (language model), Wikipedia: en.wikipedia.org/wiki/BERT_(language_model)

[2] Lilian Weng, Generalized Language Models: lilianweng.github.io/posts/2019-01-31-lm

[3] Brown et al. (2020), Language Models are Few-Shot Learners: arxiv.org/abs/2005.14165

[4] Hugging Face NLP Course, Byte-Pair Encoding: huggingface.co/learn/nlp-course/chapter6/5

[5] Hugging Face NLP Course, WordPiece: huggingface.co/learn/nlp-course/chapter6/6