Field Guide / Reference / Query execution
Query layer

Join algorithms and costs

One-page cheat sheet for the three join families, external merge sort, and aggregation: when each wins, its IO cost, how much memory it needs, and whether its output comes out sorted.

Notation. M, N = pages of outer R, inner S. m, n = tuples in R, S. B = buffer frames available. C = cost of one index probe (constant). Build on the smaller relation, probe with the larger. Costs are IO (page reads + writes); they ignore the cost of writing the final output, which is the same for every algorithm.

The five joins at a glance

Equi-join unless noted. CMU 15-445 example (M=1000, m=100000, N=500, B=100, 0.1 ms/IO) timings shown for scale.
AlgorithmIO costMemoryOutput sorted?Any predicate?Example time
Naive nested loop M + (m · N) 3 frames (outer page, inner page, output) no yes (any θ) ~1.4 hours
Block nested loop M + (⌈M/(B-2)⌉ · N) B: B-2 for outer block, 1 inner, 1 output no yes (any θ) ~6.5 s
Index nested loop M + (m · C) small; needs an index on S join key no (unless index gives order) equi / range that index supports fast when selective
Sort-merge sort(R) + sort(S) + (M + N) B for the sorts; merge needs few frames yes (on join key) equi and band/range joins ~0.75 s
Grace / hash 3 · (M + N) (in-memory: M + N) build side must fit in B (else partition) no equi-join on full key only ~0.45 s
Keep this one thing

The whole point of join algorithms is to avoid the Cartesian product (|R| × |S| comparisons). Naive nested loop pays it (quadratic m · N reread of S). Sortedness, a hash table, or an index each turn that quadratic into roughly linear work.

Picking a join

join R and S equi-join on full key? no sort-merge or block NL yes index on S key, join selective? yes index nested loop no build side fits in B? yes in-memory hash join no grace / hybrid hash join need output sorted on key? then prefer sort-merge
Figure 1. Walk top to bottom. The first "no" at the top (not an equi-join on the full key) forces sort-merge or nested loop, because no hash turns R.a < S.b into bucket equality. If you need the result ordered on the join key, sort-merge can be worth choosing even when hash is cheaper, since it hands the sort order to the next operator for free.

Each join in detail

Nested loop (naive / block / index)

VariantIO costWhen to use
Naive M + (m · N) Almost never. Rereads all of S once per outer tuple; effectively quadratic. The killer is the m · N term.
Block M + (⌈M/(B-2)⌉ · N) Small inner table, no index, no equi-join (any θ predicate works). Rereads S once per outer block, so more buffers cut the cost directly.
Index M + (m · C) S has an index on the join key and R is small or the join is selective. PostgreSQL flags nested loop as good "if the right relation can be scanned with an index scan".

Memory: naive needs only 3 frames; block uses all B (B-2 hold the outer block). Output sorted: no.

Sort-merge

Sort both inputs on the join key, then walk two cursors forward in lockstep emitting matches, like the merge step of merge sort.

Hash (build / probe, grace, hybrid)

Build: scan the smaller input, insert each tuple into a hash table keyed on the join attribute. Probe: scan the larger input, hash each tuple, look up matches, and recheck actual values (buckets collide).

FormIO costNote
In-memoryM + NBuild side fits in B. Fastest of the lot on the CMU example.
Grace / partitioned3 · (M + N)Partition phase 2(M+N) (read+write) plus probe phase M+N. Recursively repartition with a new hash h2 if a partition still overflows.
Hybridbetween the twoPostgreSQL's form: keep one partition resident and probe it directly instead of spilling it.
Exam traps

Grace hash is 3(M+N), linear, not "the same as nested loop because both spill" (naive is quadratic). You build on the smaller relation, not the larger. Hash join cannot do range or inequality joins. Sort-merge does not always sort both inputs if one already arrives sorted. Block NL rereads S once per block, not per tuple.

External merge sort (the cost behind sort-merge)

When data to sort exceeds memory you cannot sort in place. Two parts:

  1. Run generation (pass 0): read B pages, sort in memory, write a sorted run. Produces ⌈N/B⌉ runs.
  2. Multi-way merge: merge B-1 runs at a time (one input buffer per run, one output buffer), repeatedly, until one run remains. Each pass cuts the run count by a factor of B-1.
Pass count and total IO

passes = 1 + ⌈logB-1(⌈N/B⌉)⌉   and   total IO = 2N · passes (every pass reads and writes all N pages).

The "1 +" is run generation; the log term is the merge passes. More buffers cut passes logarithmically (bigger merge fan-in). One pass only when the data already fits in B pages. Two passes is the common case and works whenever the initial run count is at most B-1.

Hash versus sort aggregation

To compute GROUP BY or DISTINCT you must bring equal keys together. Two ways:

Both are pipeline breakers: the last input tuple may belong to the first group.
Sort-basedHash-based
HowSort on the group key, scan once, break into a new group when the key changes.Hash table keyed on the group; each tuple updates its group's running aggregate in place.
StateOne running group at a time.One entry per distinct group.
Output orderSorted on the group key (free if the next operator wants it).No particular order.
Cost driverDominated by the external sort.Cheap when few distinct groups; spills (per-partition files, extra passes) when many.
Wins whenInput already sorted on the key, or output needs to be sorted.Number of distinct groups is small relative to the input.
In real systems

PostgreSQL exposes GroupAggregate (sort-based, input pre-sorted) and HashAggregate (hash table); since v13 hash aggregation spills to disk when groups exceed the hash memory limit [work_mem docs]. SQLite has neither hash nor merge join: it implements joins only as nested loops and makes them fast with indexes and join ordering, and it does sort-based grouping for free when an index already supplies order [SQLite optimizer overview].

Primary sources
CMU 15-445 Lecture 12 (Join Algorithms) and Graefe, "Volcano" (IEEE TKDE 6(1), 1994)

CMU 15-445 is the source of the exact IO cost formulas and the worked timing example. Volcano is the origin of the open/next/close iterator model in which join is the one-to-one match operator with hybrid-hash and sort-merge variants, where setting both overflow thresholds to zero gives Grace-style behavior.