Field Guide / Part V · Query processing / Week 10
Query execution

Joins, external sort, and aggregation

A join is a Cartesian product with a predicate stapled on. Done literally, joining two million-row tables is 1012 comparisons. The whole craft of join algorithms is dodging that product using sortedness, a hash table, or an index. This lesson is where the iterator tree from last week meets the operators that actually cost money.

By the end you can

  • Explain why a literal join is quadratic and name the three escape routes (sort, hash, index).
  • Write the I/O cost of naive, block, and index nested loop, sort-merge, and grace hash join, and say which buffer or selectivity knob moves each one.
  • Derive the external merge sort pass count and explain why two passes is the usual case.
  • Tell sort-based from hash-based aggregation, and identify which operators are pipeline breakers.
  • Answer the classic traps: which side you build the hash table on, whether hash join does inequalities, and whether an index scan always wins.

Where this sits in the engine

Hold the running query in mind. Most of the course follows SELECT name FROM emp WHERE salary > 75000 ORDER BY name, which has no join. So widen it for one lesson. Imagine the real reporting query is SELECT e.name, d.dname FROM emp e JOIN dept d ON e.dept_id = d.id WHERE e.salary > 75000 ORDER BY e.name. Now there are two tables to combine and a final sort, and the executor has to decide how.

The layer above is the optimizer (week 11). It already chose the physical operators: it picked a join algorithm and a scan type, costed them, and handed down a physical plan tree. This lesson is the menu the optimizer was choosing from, plus the exact cost formulas it used to choose. The layer below is the access method and buffer pool: every operator here reads and writes pages through the buffer pool (week 4), which is why buffer count B shows up in every cost formula on this page. An index nested loop join leans on the B+tree from week 6.

And the operators themselves are the iterators from week 9. Each join, sort, and aggregate is an iterator with open / next / close. The new wrinkle this week is that some of them cannot stream: a sort or a hash build has to swallow its entire input before it can hand out the first tuple. Those are the pipeline breakers, and they are exactly where memory runs out and the operator spills to disk.

Intuition

A join asks: for each row on one side, which rows on the other side match? The dumb way is to ask the question fresh for every single row, rescanning the other table each time. The smart ways arrange the data so the matches are easy to find: put both sides in the same order (sort-merge), or build a lookup table keyed on the join column (hash), or keep a pre-built index so each lookup is a few page reads (index). All three trade some setup cost for not paying the quadratic rescan.

The problem: the product is quadratic

A join of R and S is the Cartesian product R × S with the matching rows kept. There are |R| × |S| pairs to consider. For two tables of a million rows that is 1012 comparisons, which is not a number you want in a per-query budget. Every join algorithm is a scheme to look at far fewer than all the pairs.

Why three algorithms and not one

The product is unavoidable only if you know nothing about the data. The moment you can sort it, both inputs in the same order let you sweep two cursors forward once and never look back. The moment you can afford a hash table, one side becomes an O(1) lookup so the other side just probes. The moment an index already exists on the join key, each lookup is a short tree descent. Each algorithm exploits a different pre-condition, so the optimizer picks based on what is true about this query: are the inputs already sorted, does the smaller side fit in memory, is there a usable index, is the predicate even an equality?

nested loop R S rescan S per R row sort-merge R↑ S↑ two cursors, one pass hash S small R big build probe
Figure 1. Nested loop rescans the inner side; sort-merge brings both sides into the same order and sweeps once; hash builds a table on the smaller side and probes it with the larger. Notice that only nested loop pays for repeated rescans.

Nested loop and why buffers matter

The naive nested loop is two for-loops: for each tuple of the outer relation R, scan all of the inner relation S. With M pages in R holding m tuples and N pages in S, the cost is

# naive: reread all of S once per outer TUPLE
cost = M + (m * N)

The m * N term is the killer. On the standard CMU worked example (M=1000, m=100000, N=500, B=100 buffers, 0.1 ms per I/O) naive nested loop is about 1.4 hours [CMU 15-445 L12]. The block variant fixes the obvious waste: reread S once per outer block, not per tuple. Give the outer scan B − 2 frames (one frame holds an inner page, one holds output):

# block: reread all of S once per outer BLOCK
cost = M + (ceil(M / (B - 2)) * N)

On the same example this drops to about 6.5 seconds. This is the first place buffer count earns its keep: more frames mean a bigger outer block, fewer rereads of S, lower cost. The index variant replaces the inner scan with an index probe at constant cost C per outer tuple:

# index: probe an index on S's join key per outer tuple
cost = M + (m * C)

Index nested loop wins when S has an index on the join column and either R is small or the join is selective. PostgreSQL flags this case directly: nested loop "can be a good strategy" specifically "if the right relation can be scanned with an index scan" [PG planner].

Sort-merge join

Sort both inputs on the join key, then walk two cursors forward in lockstep and emit matches, exactly like the merge step of merge sort. Because most keys are mostly unique, the merge is roughly M + N: one pass over each side. The total is the cost of the two sorts plus the merge.

Keep this one thing

Sort-merge does not always sort. If an input already arrives sorted on the join key (from a clustered index scan, or an earlier sort, or an upstream ORDER BY) that sort cost is zero and only the M + N merge remains. Sort-merge is also free for the next operator if the query wants the output sorted on the join key anyway, because the merge hands that order on.

The worst case is pathological: if every tuple shares one join value, the merge degrades to M * N because each outer page must rescan the whole inner side for that single value. That is the same quadratic blow-up as nested loop, hiding inside a merge.

Hash join, in memory and on disk

In-memory hash join has two phases. Build: scan the smaller (build) input and insert each tuple into a hash table keyed on the join attribute. Probe: scan the larger (probe) input, hash each tuple, look up matches, and re-check the actual join values because buckets collide. When the build side fits in memory the cost is about M + N, the cheapest of the lot (about 0.45 s on the CMU example).

Exam trap

Two facts that the "which is false" questions live on. First, you build on the smaller relation, not the larger, so the hash table is more likely to fit in memory. Second, hash join works only for equi-joins on the full join key. There is no hash that turns R.a < S.b into bucket equality, so range and inequality joins fall back to nested loop or merge.

When the build side does not fit in memory, in-memory hashing breaks. Grace (partitioned) hash join fixes it. Phase 1, partition: hash both inputs with one hash function h1 into the same set of partitions written to disk, so that R partition i can only match S partition i. Phase 2, probe: for each partition pair, build a hash table on the smaller side and probe with the other. If a single partition still does not fit, recursively repartition it with a different hash function h2.

# grace: read once + write once to partition, then read once to join
partition_cost = 2 * (M + N)     # read both, write both
probe_cost     = (M + N)         # read both partitioned files once
total          = 3 * (M + N)

That is linear in the input size, which is why grace hash at 3*(M+N) crushes naive nested loop at M + m*N: 0.45 seconds versus 1.4 hours on the same tables. PostgreSQL implements the hybrid variant, which keeps one partition resident in memory and probes it immediately instead of spilling it [nodeHashjoin.c].

phase 1: partition both inputs by h1 R S R0..Rk and S0..Sk on disk phase 2: join pair by pair Ri Si build Si, probe Ri
Figure 2. Grace hash join. Phase 1 routes both inputs through h1 so matching keys land in matching partition numbers. Phase 2 only ever joins Ri against Si, never across partitions, so the in-memory join is small even when the whole tables are not.

Simulator: the join algorithm cost playground

Set the table sizes, the buffer frames, and the join selectivity, then read the exact I/O cost of all five algorithms at once. Experiment: drag B up and watch block nested loop and the sort passes fall; make the join selective and watch index nested loop take the lead; push the build side over the memory limit and watch in-memory hash flip to grace and gain the partition pass. The bars are log-scaled wall time so a quadratic loser and a linear winner fit on one chart.

Join cost playground exact I/O formulas from CMU 15-445 L12
Run the CMU example (M=1000, m=100000, N=500, B=100, 0.1 ms/IO) and confirm naive nested loop dwarfs hash by orders of magnitude: about 1.4 hours versus 0.45 s. Then drop B to 3 and watch block nested loop collapse toward naive.

External merge sort: sorting more than fits in memory

Sort-merge join and sort-based aggregation both need to sort, and the input is often larger than memory. You cannot sort in place, so you sort in two parts. Run generation (pass 0): read the input B pages at a time, sort each chunk in memory, write it out as a sorted run. That produces ceil(N / B) sorted runs. Multi-way merge: merge runs B − 1 at a time (one input buffer per run, one output buffer), repeatedly, until one run remains. Each pass cuts the run count by a factor of B − 1.

passes  = 1 + ceil(log_{B-1}(ceil(N / B)))   # the "1+" is run generation
total_IO = 2 * N * passes                    # every pass reads and writes all N pages
Exam trap

External merge sort is not one pass. One pass happens only when the data already fits in B pages. The realistic common case is two passes, which works whenever the number of initial runs is at most B − 1. More buffers cut passes logarithmically because the merge fan-in grows.

External merge sort pass counter watch the formula emerge
Hold N fixed and raise B. The number of merge passes drops in steps, not smoothly, because the fan-in B−1 changes the logarithm base.

Aggregation two ways

A GROUP BY (or DISTINCT) has to bring equal keys together. There are two ways, and they are duals of the two join strategies.

Sort-based versus hash-based aggregation.
PropertySort-basedHash-based
Methodsort on the group key, then break into a new group whenever the key changesone hash-table entry per distinct group, updated in place
State heldone running group at a timeone entry per distinct group
Wins wheninput already sorted on the key, or output must be sortedfew distinct groups relative to input size
Output ordersorted on the group keyno particular order
Spillsvia the external sortoverflow groups to per-partition files (grace-style)
Exam trap

The one people invert: sort-based aggregation output is sorted; hash-based output is in no particular order. If a question says "sort-based aggregation output is unordered" it is describing hash aggregation, and it is false.

Pipeline breakers: where memory bites

Last week the iterator model let tuples stream through filter and projection one at a time. Some operators cannot stream. A sort cannot know the first sorted tuple until it has seen the last input tuple. A hash join's build side must be fully built before any probe can match. Aggregation and DISTINCT cannot emit the first group until the input ends, because the last tuple might belong to that group. These are the pipeline breakers, and they are exactly where a whole input has to live in memory, so they set the query's memory budget and are where spilling to disk begins.

In real systems

PostgreSQL governs that budget with work_mem, "the base maximum amount of memory to be used by a query operation (such as a sort or hash table) before writing to temporary disk files," default 4 MB and applied per operator per session, so one query with several sorts and hashes can use many multiples of it [PG resource config]. Hash operations get work_mem * hash_mem_multiplier (default 2.0) because they are "more sensitive to memory availability than equivalent sort-based operations." SQLite goes the other way: it implements joins as nested loops only, with no hash join and no merge join, and makes them fast with indexes and good join ordering [SQLite optimizer overview].

The cost cheat sheet

I/O cost of each join algorithm. M,N are pages; m is tuples in R; B is buffer frames; C is the per-probe index cost.
AlgorithmI/O costWins when
Naive nested loopM + m·Ntiny inner, or no other option
Block nested loopM + ceil(M/(B−2))·Nmore buffers available
Index nested loopM + m·Cindex on inner key, selective join
Sort-mergesort(M) + sort(N) + (M+N)inputs pre-sorted, or output sorted
Grace hash3·(M+N)equi-join, build side over memory
Deeper: why one general operator handles join, aggregation, and DISTINCT

The Volcano paper folds join, semi-join, all three outer joins, anti-join, intersection, union, difference, aggregation, and duplicate elimination into one physical operator it calls one-to-one match (Section III.B.2). The unifying step is the same for all of them: separate the matching and non-matching subsets of two inputs, then emit the subsets you want, possibly transformed. A join emits the matching pairs; a semi-join emits matching R rows once; aggregation treats one input as empty and groups the other. Both the hash and the sort-merge implementations live inside that one module, which is why the same overflow machinery (the packing and spilling thresholds) serves a hash join and a hash aggregation. Graefe even added a "flush" phase to the hash version specifically so aggregation could emit its groups after the build, and the build / probe / flush phases map cleanly onto open (build) and next (probe then flush).

Active recall

Which side do you build the hash table on, and why?
The smaller relation, so the hash table is more likely to fit in memory. You then probe it with the larger relation. Cost is about M + N when the build side fits.
Why is grace hash 3*(M+N) and not 2*(M+N)?
Partitioning reads both inputs and writes both back to disk: 2*(M+N). The probe phase then reads both partitioned files once more: +(M+N). Total 3*(M+N).
When does external merge sort take exactly one pass?
Only when the whole input fits in B pages, so run generation produces a single sorted run and no merge is needed. Otherwise passes = 1 + ceil(log_{B-1}(ceil(N/B))).

Paper viva

This week shares its assigned paper with week 9: Graefe's Volcano. The examiner will probe how the abstract iterator model from last week meets the concrete join and overflow machinery of this one.

Viva prompt 1

Volcano implements join with two algorithm families, hybrid hash and sort-based merge. For the hash version, how do its build, probe, and flush phases map onto the open / next / close iterator interface?

Model answer

In the hash one-to-one match, open performs the build: it consumes the entire first (build) input and inserts it into the in-memory hash table. The probe and flush phases both run inside next. Successive next calls probe the second input against the hash table and emit matches; when the second input is exhausted, next switches automatically to the flush phase, which is Graefe's addition so the same operator can serve aggregation (emit the accumulated groups). Because the entire build input is consumed in open before any output, the build side is a pipeline breaker. This is from Volcano Section III.B.2. The point to make is that even a stateful, memory-hungry operator hides behind the same three-procedure interface, so the operators above it never learn it is a hash join rather than a scan.

Viva prompt 2

Textbooks suggest passing intermediate results between operators through temporary files. Volcano refuses to. Where do intermediate results live instead, and what is the cost argument?

Model answer

Volcano keeps intermediate streams on virtual devices whose pages exist only in the buffer pool and vanish when unpinned (Section III.A, III.B). The cost argument is that routing data through real temporary files carries a substantial penalty and is used in neither real systems nor Volcano. Each record handed between operators is pinned in the buffer and owned by exactly one operator at a time; an operator may hold it, unfix it on a predicate failure, or pass it on. To keep buffer-manager call counts low the interface was redesigned to need two buffer calls per cluster on the producer side and one per cluster on the consumer side, independent of how many records a cluster holds. The connection to this week: this is exactly why a pipelined chain (scan to filter to projection to probe) never touches disk, while a pipeline breaker that overflows memory is the one place data does spill, through the same buffer and file machinery.

Viva prompt 3

Volcano controls hash overflow with two state-record parameters, a packing threshold and a spilling threshold. What do they do, and what does setting both to zero give you?

Model answer

The hash table points directly into buffer-resident records, with no copying. When the item count reaches the packing threshold, items are packed densely into overflow files but not yet written to disk. When it reaches the spilling threshold, the first partition file is unfixed and written to disk and the count is reduced; the cycle repeats and partitioning recurses with adjusted thresholds (Section III.B.2). Setting both thresholds to zero gives Grace-style overflow avoidance: nothing is kept resident, everything is partitioned to disk up front, which is the textbook grace hash join. This is Volcano's "mechanism, not policy" principle in miniature. The operator provides the spilling mechanism, and an optimizer sets the thresholds from estimated input-size distributions, choosing anywhere on the spectrum from fully in-memory hybrid hash to pure grace.

Check yourself

Which statement about hash join is false?

Hash join is equi-join only, on the full join key. There is no hash that turns an inequality into bucket equality, so range and inequality joins fall back to nested loop or merge. The other three are correct.

Block nested loop join costs M + ceil(M/(B−2)) times N. Why does adding buffer frames help?

The outer side gets B−2 frames, so a bigger B means a bigger outer block and ceil(M/(B−2)) fewer rereads of the inner table. The inner is reread once per outer block, not per outer tuple (that is the naive form).

Which statement about external merge sort is false?

A single pass happens only when the data already fits in B pages. In general passes = 1 + ceil(log_{B-1}(ceil(N/B))), and two passes is the common realistic case. The other three are correct descriptions of the algorithm.

Which statement about aggregation is false?

Inverted on purpose. Sort-based aggregation comes out sorted on the group key; it is hash-based aggregation whose output has no order. The other three are accurate.

Which statement about sort-merge join is false?

Sort-merge is an equi-join method in practice; the lockstep merge matches equal keys. Sortedness does not make an inequality join cheap. The pre-sorted input saving, the M + N merge, and the single-value worst case are all real.

A sort node and a hash join's build side are both pipeline breakers. What does that mean?

A pipeline breaker cannot emit until it has seen its entire input: the sort needs the last tuple to know the first sorted one, and a hash build must finish before any probe can match. That is why breakers set the query's memory budget and are where spilling begins.

Primary source

The exact I/O cost formulas behind the playground (naive, block, and index nested loop; sort-merge; grace hash at 3 times (M+N)) and the worked timing example that ranks the algorithms from 1.4 hours down to 0.45 seconds. Focus on the Cost Analysis sections under each join.

Ask your teacher

Want me to walk the grace hash recursion when even a single partition overflows, derive the two-pass sort condition from the formula, or quiz you on which join the optimizer picks for the widened emp JOIN dept query? Ask and I will go deeper or harder. I am your teacher for this course, not just a document.