Join algorithms and costs
One-page cheat sheet for the three join families, external merge sort, and aggregation: when each wins, its IO cost, how much memory it needs, and whether its output comes out sorted.
M, N = pages of outer R, inner S.
m, n = tuples in R, S.
B = buffer frames available.
C = cost of one index probe (constant).
Build on the smaller relation, probe with the larger. Costs are IO (page reads + writes); they
ignore the cost of writing the final output, which is the same for every algorithm.
The five joins at a glance
| Algorithm | IO cost | Memory | Output sorted? | Any predicate? | Example time |
|---|---|---|---|---|---|
| Naive nested loop | M + (m · N) | 3 frames (outer page, inner page, output) | no | yes (any θ) | ~1.4 hours |
| Block nested loop | M + (⌈M/(B-2)⌉ · N) | B: B-2 for outer block, 1 inner, 1 output | no | yes (any θ) | ~6.5 s |
| Index nested loop | M + (m · C) | small; needs an index on S join key | no (unless index gives order) | equi / range that index supports | fast when selective |
| Sort-merge | sort(R) + sort(S) + (M + N) | B for the sorts; merge needs few frames | yes (on join key) | equi and band/range joins | ~0.75 s |
| Grace / hash | 3 · (M + N) (in-memory: M + N) | build side must fit in B (else partition) | no | equi-join on full key only | ~0.45 s |
The whole point of join algorithms is to avoid the Cartesian product (|R| × |S| comparisons). Naive nested loop pays it (quadratic m · N reread of S). Sortedness, a hash table, or an index each turn that quadratic into roughly linear work.
Picking a join
Each join in detail
Nested loop (naive / block / index)
| Variant | IO cost | When to use |
|---|---|---|
| Naive | M + (m · N) | Almost never. Rereads all of S once per outer tuple; effectively quadratic. The killer is the m · N term. |
| Block | M + (⌈M/(B-2)⌉ · N) | Small inner table, no index, no equi-join (any θ predicate works). Rereads S once per outer block, so more buffers cut the cost directly. |
| Index | M + (m · C) | S has an index on the join key and R is small or the join is selective. PostgreSQL flags nested loop as good "if the right relation can be scanned with an index scan". |
Memory: naive needs only 3 frames; block uses all B (B-2 hold the outer block). Output sorted: no.
Sort-merge
Sort both inputs on the join key, then walk two cursors forward in lockstep emitting matches, like the merge step of merge sort.
- Cost: sort(R) + sort(S) + (M + N). Per-input sort cost 2K · (1 + ⌈logB-1(K/B)⌉) for K pages.
- Worst case: if every tuple shares one join value the merge degrades to M · N (each outer page rescans all of S for that value).
- When to use: an input already arrives sorted on the key (clustered index, prior sort or ORDER BY), so its sort cost is zero and only the M + N merge remains; or the query needs output sorted on the join key. Also handles band/range joins, which hash cannot.
- Memory: B frames for the external sorts. Output sorted: yes, on the join key, handed free to the next operator.
Hash (build / probe, grace, hybrid)
Build: scan the smaller input, insert each tuple into a hash table keyed on the join attribute. Probe: scan the larger input, hash each tuple, look up matches, and recheck actual values (buckets collide).
| Form | IO cost | Note |
|---|---|---|
| In-memory | M + N | Build side fits in B. Fastest of the lot on the CMU example. |
| Grace / partitioned | 3 · (M + N) | Partition phase 2(M+N) (read+write) plus probe phase M+N. Recursively repartition with a new hash h2 if a partition still overflows. |
| Hybrid | between the two | PostgreSQL's form: keep one partition resident and probe it directly instead of spilling it. |
- Hard limits: equi-join on the full key only; build on the smaller input so the table fits.
- Memory: the build-side hash table (PostgreSQL grants work_mem × hash_mem_multiplier). Output sorted: no.
- Skew trap: if one key has so many matches they overflow memory, no repartition helps (they hash identically); fall back to block nested loop for that key.
Grace hash is 3(M+N), linear, not "the same as nested loop because both spill" (naive is quadratic). You build on the smaller relation, not the larger. Hash join cannot do range or inequality joins. Sort-merge does not always sort both inputs if one already arrives sorted. Block NL rereads S once per block, not per tuple.
External merge sort (the cost behind sort-merge)
When data to sort exceeds memory you cannot sort in place. Two parts:
- Run generation (pass 0): read B pages, sort in memory, write a sorted run. Produces ⌈N/B⌉ runs.
- Multi-way merge: merge B-1 runs at a time (one input buffer per run, one output buffer), repeatedly, until one run remains. Each pass cuts the run count by a factor of B-1.
passes = 1 + ⌈logB-1(⌈N/B⌉)⌉ and total IO = 2N · passes (every pass reads and writes all N pages).
The "1 +" is run generation; the log term is the merge passes. More buffers cut passes logarithmically (bigger merge fan-in). One pass only when the data already fits in B pages. Two passes is the common case and works whenever the initial run count is at most B-1.
Hash versus sort aggregation
To compute GROUP BY or DISTINCT you must bring equal keys together. Two ways:
| Sort-based | Hash-based | |
|---|---|---|
| How | Sort on the group key, scan once, break into a new group when the key changes. | Hash table keyed on the group; each tuple updates its group's running aggregate in place. |
| State | One running group at a time. | One entry per distinct group. |
| Output order | Sorted on the group key (free if the next operator wants it). | No particular order. |
| Cost driver | Dominated by the external sort. | Cheap when few distinct groups; spills (per-partition files, extra passes) when many. |
| Wins when | Input already sorted on the key, or output needs to be sorted. | Number of distinct groups is small relative to the input. |
PostgreSQL exposes GroupAggregate (sort-based, input pre-sorted) and HashAggregate (hash table);
since v13 hash aggregation spills to disk when groups exceed the hash memory limit
[work_mem docs].
SQLite has neither hash nor merge join: it implements joins only as nested loops and makes them fast with indexes and join
ordering, and it does sort-based grouping for free when an index already supplies order
[SQLite optimizer overview].
CMU 15-445 is the source of the exact IO cost formulas and the worked timing example. Volcano is the origin of the open/next/close iterator model in which join is the one-to-one match operator with hybrid-hash and sort-merge variants, where setting both overflow thresholds to zero gives Grace-style behavior.