Join algorithms and costs · Field Guide to the Database Engine

Notation. M, N = pages of outer R, inner S. m, n = tuples in R, S. B = buffer frames available. C = cost of one index probe (constant). Build on the smaller relation, probe with the larger. Costs are IO (page reads + writes); they ignore the cost of writing the final output, which is the same for every algorithm.

The five joins at a glance

Equi-join unless noted. CMU 15-445 example (M=1000, m=100000, N=500, B=100, 0.1 ms/IO) timings shown for scale.
Algorithm	IO cost	Memory	Output sorted?	Any predicate?	Example time
Naive nested loop	M + (m · N)	3 frames (outer page, inner page, output)	no	yes (any θ)	~1.4 hours
Block nested loop	M + (⌈M/(B-2)⌉ · N)	B: B-2 for outer block, 1 inner, 1 output	no	yes (any θ)	~6.5 s
Index nested loop	M + (m · C)	small; needs an index on S join key	no (unless index gives order)	equi / range that index supports	fast when selective
Sort-merge	sort(R) + sort(S) + (M + N)	B for the sorts; merge needs few frames	yes (on join key)	equi and band/range joins	~0.75 s
Grace / hash	3 · (M + N) (in-memory: M + N)	build side must fit in B (else partition)	no	equi-join on full key only	~0.45 s

Keep this one thing

The whole point of join algorithms is to avoid the Cartesian product (|R| × |S| comparisons). Naive nested loop pays it (quadratic m · N reread of S). Sortedness, a hash table, or an index each turn that quadratic into roughly linear work.

Picking a join

Figure 1. Walk top to bottom. The first "no" at the top (not an equi-join on the full key) forces sort-merge or nested loop, because no hash turns R.a < S.b into bucket equality. If you need the result ordered on the join key, sort-merge can be worth choosing even when hash is cheaper, since it hands the sort order to the next operator for free.

Each join in detail

Nested loop (naive / block / index)

Variant	IO cost	When to use
Naive	M + (m · N)	Almost never. Rereads all of S once per outer tuple; effectively quadratic. The killer is the m · N term.
Block	M + (⌈M/(B-2)⌉ · N)	Small inner table, no index, no equi-join (any θ predicate works). Rereads S once per outer block, so more buffers cut the cost directly.
Index	M + (m · C)	S has an index on the join key and R is small or the join is selective. PostgreSQL flags nested loop as good "if the right relation can be scanned with an index scan".

Memory: naive needs only 3 frames; block uses all B (B-2 hold the outer block). Output sorted: no.

Sort-merge

Sort both inputs on the join key, then walk two cursors forward in lockstep emitting matches, like the merge step of merge sort.

Cost: sort(R) + sort(S) + (M + N). Per-input sort cost 2K · (1 + ⌈log_B-1(K/B)⌉) for K pages.
Worst case: if every tuple shares one join value the merge degrades to M · N (each outer page rescans all of S for that value).
When to use: an input already arrives sorted on the key (clustered index, prior sort or ORDER BY), so its sort cost is zero and only the M + N merge remains; or the query needs output sorted on the join key. Also handles band/range joins, which hash cannot.
Memory: B frames for the external sorts. Output sorted: yes, on the join key, handed free to the next operator.

Hash (build / probe, grace, hybrid)

Build: scan the smaller input, insert each tuple into a hash table keyed on the join attribute. Probe: scan the larger input, hash each tuple, look up matches, and recheck actual values (buckets collide).

Form	IO cost	Note
In-memory	M + N	Build side fits in B. Fastest of the lot on the CMU example.
Grace / partitioned	3 · (M + N)	Partition phase 2(M+N) (read+write) plus probe phase M+N. Recursively repartition with a new hash h2 if a partition still overflows.
Hybrid	between the two	PostgreSQL's form: keep one partition resident and probe it directly instead of spilling it.

Hard limits: equi-join on the full key only; build on the smaller input so the table fits.
Memory: the build-side hash table (PostgreSQL grants work_mem × hash_mem_multiplier). Output sorted: no.
Skew trap: if one key has so many matches they overflow memory, no repartition helps (they hash identically); fall back to block nested loop for that key.

Exam traps

Grace hash is 3(M+N), linear, not "the same as nested loop because both spill" (naive is quadratic). You build on the smaller relation, not the larger. Hash join cannot do range or inequality joins. Sort-merge does not always sort both inputs if one already arrives sorted. Block NL rereads S once per block, not per tuple.

External merge sort (the cost behind sort-merge)

When data to sort exceeds memory you cannot sort in place. Two parts:

Run generation (pass 0): read B pages, sort in memory, write a sorted run. Produces ⌈N/B⌉ runs.
Multi-way merge: merge B-1 runs at a time (one input buffer per run, one output buffer), repeatedly, until one run remains. Each pass cuts the run count by a factor of B-1.

Pass count and total IO

passes = 1 + ⌈log_B-1(⌈N/B⌉)⌉ and total IO = 2N · passes (every pass reads and writes all N pages).

The "1 +" is run generation; the log term is the merge passes. More buffers cut passes logarithmically (bigger merge fan-in). One pass only when the data already fits in B pages. Two passes is the common case and works whenever the initial run count is at most B-1.

Hash versus sort aggregation

To compute GROUP BY or DISTINCT you must bring equal keys together. Two ways:

Both are pipeline breakers: the last input tuple may belong to the first group.
	Sort-based	Hash-based
How	Sort on the group key, scan once, break into a new group when the key changes.	Hash table keyed on the group; each tuple updates its group's running aggregate in place.
State	One running group at a time.	One entry per distinct group.
Output order	Sorted on the group key (free if the next operator wants it).	No particular order.
Cost driver	Dominated by the external sort.	Cheap when few distinct groups; spills (per-partition files, extra passes) when many.
Wins when	Input already sorted on the key, or output needs to be sorted.	Number of distinct groups is small relative to the input.

In real systems

PostgreSQL exposes GroupAggregate (sort-based, input pre-sorted) and HashAggregate (hash table); since v13 hash aggregation spills to disk when groups exceed the hash memory limit [work_mem docs]. SQLite has neither hash nor merge join: it implements joins only as nested loops and makes them fast with indexes and join ordering, and it does sort-based grouping for free when an index already supplies order [SQLite optimizer overview].

Primary sources

CMU 15-445 Lecture 12 (Join Algorithms) and Graefe, "Volcano" (IEEE TKDE 6(1), 1994)

CMU 15-445 is the source of the exact IO cost formulas and the worked timing example. Volcano is the origin of the open/next/close iterator model in which join is the one-to-one match operator with hybrid-hash and sort-merge variants, where setting both overflow thresholds to zero gives Grace-style behavior.