Query optimization: cost, selectivity, and join ordering
A query has thousands of equivalent plans, and the slowest is often millions of times slower than the fastest. They all return the same rows. This lesson is about how the engine looks at that pile and reaches for a good one without ever running them.
By the end you can
- Explain why a declarative query forces the engine to search an exponential plan space, and why a fixed rule list is not enough.
- Read and apply the System R cost model and selectivity factors, and say where the uniformity and independence assumptions break.
- Walk the Selinger dynamic program over relation subsets, the left-deep restriction, and why interesting orders are kept.
- Tell a clustered scan from a non-clustered one by cost, and find the selectivity crossover where a full scan beats an index.
- Answer the "which is false" traps about optimality, index preference, and ANDed selectivities.
Stay on the one query the course follows down the stack:
SELECT name FROM emp WHERE salary > 75000 ORDER BY name;
The parser and rewriter (week 8) already handed this layer a
logical plan: a projection of name over a selection of salary > 75000 over a
scan of emp, with a sort for the ORDER BY. A logical plan says what, not
how. It does not say whether to read emp with a sequential scan or an index on
salary, and it does not say whether the sort can be skipped because some access path already
returns rows in name order. This week owns those decisions. The optimizer turns the logical
plan into a physical plan, a tree of concrete operators, and hands it to the
Volcano executor (week 9), which runs it as a tree of
open / next / close iterators. When the optimizer chooses an index scan, it leans on the
B+tree (week 6); when it costs a join, it picks among the
algorithms from week 10. Every page-count term in its cost
model is a request that will eventually go through the buffer pool
(week 4) to storage (week 2).
So the optimizer sits in the middle of journey 1. Above it: a logical algebra tree. Below it: an executor that will faithfully run whatever physical tree it is handed, fast plan or catastrophe alike. The executor never second-guesses. That is exactly why this layer carries the whole weight of the decision.
Think of the optimizer as a travel planner with no way to test-drive any route. It cannot run a plan to see how slow it is, because running it is the expensive thing you are trying to avoid. So it estimates: it guesses how many rows survive each filter, multiplies those guesses by how many disk pages and CPU calls each operator will cost, adds it all up into one number, and compares numbers. The estimates are often wrong in absolute terms. The bet is that they are right enough in relative terms to rank a good plan above a disastrous one.
Why search at all
Three choices sit inside a query, and each is independent of the others. For every table you pick an
access path: a full scan, or a scan of one of the available indexes. For every join you pick a method:
nested loop, sort-merge, or hash. And you pick the order in which to join the tables. The choices
multiply. For n tables in the FROM list there are already n! join
orders, and Selinger names exactly this, "n factorial permutations of relation join orders," in the 1979
System R paper [Selinger 1979, §5].
Multiply by access paths per table and methods per join, and the plan space is far too large to walk.
The plans are not close together in cost. They differ by orders of magnitude. A plan that scans an
index and merges might touch a few hundred pages; the same query run as full scans feeding a Cartesian
product can touch billions, for the same answer. The single worst mistake is to evaluate a cross product
first and filter afterward, so that an operator builds a huge intermediate result that the
WHERE clause then throws almost all of away.
The naive design is a rule-based optimizer: a fixed priority list of transformations applied without
looking at the data. The classic rule is "prefer an index over a full scan, always." It is cheap and
predictable, and it is wrong whenever the predicate matches most of the table. Suppose
salary > 75000 selects 90 percent of emp. A non-clustered index scan then
chases roughly one heap page per matching row, while a full scan reads each page once. The index loses
badly, but the rule cannot see that because it never asked how many rows survive. The data, not the
query shape, decides. That is the case for a cost-based optimizer that estimates and compares,
which is what System R, PostgreSQL, and SQLite all are at their core
[PG query planning].
The cost model: one number from I/O and CPU
To compare plans you need a single comparable quantity. System R blends disk work and CPU work into one weighted sum:
# System R cost (Selinger 1979, section 4)
COST = PAGE_FETCHES + W * (RSI_CALLS)
PAGE_FETCHES is the I/O term, the number of pages read. RSI_CALLS is the
number of tuples returned across the storage interface, standing in for CPU because "most of System R's
CPU time is spent in the RSS"
[Selinger 1979, §4].
W is an adjustable weight between I/O and CPU. The unit is artificial. Nobody claims the
number is seconds. Only the relative ordering of two plans' costs is meaningful, which is the whole point:
the optimizer needs to rank, not to predict wall-clock time.
"Cost is measured in seconds" is false, and so is "the optimizer finds the optimal plan." Cost is a unitless weighted number. And Selinger's own validation found the predicted-optimal plan was "often not accurate in absolute value," yet the ordering of estimated costs usually matched the ordering of measured costs [Selinger 1979, §7]. The optimizer finds a good plan under estimates and reliably dodges disasters. It does not prove optimality.
Single-relation access costs and why clustering matters
The selectivity factor F is the fraction of rows a predicate passes. The access-cost
formulas (Selinger Table 2) show how an index changes the I/O term. TCARD is the number
of data pages holding the relation, NCARD is its tuple count, NINDX is the
number of pages in the index, and P is the fraction of data pages that actually hold this
relation's tuples.
| Access path | Page-fetch cost | Reads to scale with |
|---|---|---|
| Clustered index, matching predicate | F * (NINDX + TCARD) | data pages (TCARD) |
| Non-clustered index, matching predicate | F * (NINDX + NCARD) | matching tuples (NCARD) |
| Full (segment) scan | TCARD / P | every data page once |
The structural point is the TCARD-versus-NCARD swap. A clustered index stores
rows in roughly the index order, so a range scan touches the selected fraction of the data pages
in sequence: cost tracks TCARD. A non-clustered index has no control over physical placement,
so in the worst case every matching tuple sits on its own page, and the cost tracks the matching
tuple count NCARD. Because NCARD is usually much larger than
TCARD (many tuples per page), a non-clustered index gets expensive fast as F
grows, and at some selectivity the full scan TCARD/P wins. That crossover is the entire
reason clustering matters, and it is the death of the "index is always faster" rule.
F but
climbs steeply, because cost tracks the matching tuple count. Past the crossover (red), the flat full
scan is cheaper. A rule that always picks the index lives on the wrong side of that line for any selective
query that still matches a lot of rows.Estimating selectivity: where the guesses come from
Every cost depends on F, so the optimizer must estimate it per predicate. System R's
default selectivity factors (Selinger Table 1) are the canonical constants every later system started
from:
| Predicate | Selectivity F | Fallback when no stats |
|---|---|---|
col = value | 1 / distinct_keys | 1/10 |
col > value (range) | linear interpolation in the key range | 1/3 |
col BETWEEN a AND b | (b - a) / (high - low) | 1/4 |
p1 AND p2 | F(p1) * F(p2) | (multiply) |
p1 OR p2 | F(p1) + F(p2) - F(p1)*F(p2) | (inclusion-exclusion) |
NOT p | 1 - F(p) | (complement) |
Two assumptions are baked in, and the paper names both. Uniformity: F = 1/distinct_keys
for equality assumes "an even distribution of tuples among the index key values." Independence: the
AND rule multiplies selectivities, which Selinger flags directly with "Note that this assumes
that column values are independent"
[Selinger 1979, §4].
Both fail on real data. Uniformity fails on skew (a few values dominate). Independence fails on
correlation, where a second predicate adds almost no filtering once the first holds, so the product is a
large underestimate.
"Under independence, ANDed selectivities are added" is false. They multiply:
F(p1 AND p2) = F(p1) * F(p2). It is the OR case that adds (and then subtracts
the overlap). Mixing these up is the most common selectivity mistake.
Histograms attack the uniformity failure
To stop pretending the distribution is flat, modern systems store a histogram of the column. Two layouts, and the difference is the whole lesson:
- Equi-width: buckets cover equal value ranges; their row counts vary. A skewed column dumps most rows into one bucket, so the estimate inside that dense bucket is exactly where it is worst.
- Equi-depth (equi-height): boundaries are chosen so each bucket holds roughly equal row counts; value ranges vary. Dense regions get many narrow buckets, sparse regions get few wide ones. Resolution lands where the data is.
PostgreSQL uses an equi-depth histogram for the smooth tail plus a separate most-common-values (MCV)
list for the spikes, and the histogram is built only from values not already captured as MCVs, so the two
structures partition the column
[PG planner stats]. The docs work
a range estimate for unique1 < 1000 over a 10000-row table with histogram bounds
{0,993,1997,3050,...}:
selectivity = (1 + (1000 - 993)/(1997 - 993)) / 10 = 0.100697
rows = 10000 * 0.100697 = 1007
One full bucket below the value, plus the interpolated fraction of the straddled bucket, divided by the bucket count. The first simulator below lets you skew a column and watch equi-width go wrong while equi-depth stays close.
x < 30. Equi-width misses by a wide margin in the dense
region; equi-depth tracks the true count. Then flip the AND toggle and watch the independence assumption
underestimate the correlated pair.PostgreSQL keeps per-column stats in pg_statistic, readable through pg_stats:
null_frac, n_distinct, most_common_vals,
histogram_bounds, and correlation. They are filled by ANALYZE,
which samples rather than reading the whole table, so they are "always approximate even when freshly
updated"
[PG planner stats]. The encoding
of n_distinct is a trap of its own: a positive value is a literal count, a negative value is
a fraction of table size, and n_distinct = -1 means every value is distinct (so the estimate
survives table growth). To fight the independence failure, CREATE STATISTICS stores
multivariate stats so correlated predicates are not naively multiplied.
Join ordering: the Selinger dynamic program
Join order is where the plan space explodes, and it is where System R's central idea lives. First, a restriction on the shape of the tree.
The engine of the search is dynamic programming over subsets of relations. It rests on a substructure property the paper states plainly: "once the first k relations are joined, the method to join the composite to the k+1-st relation is independent of the order of joining the first k" [Selinger 1979, §5]. So the cheapest way to produce a given set of relations can be computed once and reused as a building block. Build bottom up: cheapest access path for each single relation, then cheapest join for each pair built from singletons, then each triple built from pairs, and so on to the full set.
- For each single relation, find the cheapest access path (one per interesting order, plus the cheapest unordered path).
- For each pair, find the cheapest join, reusing the single-relation solutions.
- For each three-relation set, extend the best two-relation plans by joining one more relation.
- At the top, pick the cheapest full-set plan that delivers any required output order.
The table holds at most 2^n subsets times the number of interesting orders, so the size is
exponential in n but the constant is small. Selinger reports eight-table joins optimized in a
few seconds on 1970s hardware. Two pruning rules shrink it further. Cartesian products are pushed
late: the search only extends a composite with a relation that has a join predicate to something
already in it, so for predicates on T1-T2 and T2-T3 the order T1-T3-T2 is never considered, since it would
force a T1×T3 cross product. Dominated plans are discarded: keep only the cheapest plan per
subset, except that you also keep the cheapest plan per interesting order.
Interesting orders: paying now to save a sort later
A sort order is "interesting" if a later operator could exploit it: the query's ORDER BY or
GROUP BY columns, and every join column. The DP normally keeps one cheapest
plan per subset, but it must also keep the cheapest plan for each interesting order, even when that plan is
not the globally cheapest. Why pay for a pricier scan now? Because a plan that already produces tuples
sorted on a join column lets a downstream sort-merge join skip its own sort, which can more than repay the
cost. This bookkeeping is what makes System R more than a naive least-cost search, and it is the
contribution the paper itself highlights. Recall from week 10
that sort-merge join is cheap when its inputs arrive sorted and expensive when it has to sort them, so an
interesting order is exactly the lever that flips which join algorithm wins.
The optimizer never runs a plan to score it. It estimates how many rows each operator emits (selectivity), turns that into a single comparable cost number, and uses dynamic programming to build the cheapest join order out of cheapest sub-plans. It picks a good plan under guesses, not a provably optimal one.
Heuristic rewrites sit above the search
Some transformations are almost always good, so the engine applies them unconditionally as rewrites
rather than costing them as alternatives. Predicate pushdown moves a WHERE filter as
close to the base scan as possible, so rows are discarded before they enter a join. This is the highest-value
rewrite, because it shrinks every input downstream. Projection pushdown drops columns that nothing
downstream needs, so intermediate tuples are narrower and more fit per page. Join elimination removes
a join that cannot change the result, such as a foreign-key lookup whose columns are unused and whose key
is guaranteed present. The week 8 rewriter already did some of this; the rest happens here, and it interacts
with cost only through the smaller inputs it produces.
"Predicate pushdown is a cost-based decision" is false. Pushdown, projection pushdown, and join elimination are heuristic rewrites applied unconditionally because they are essentially always beneficial. They are not costed against an alternative. The cost-based machinery is for access paths, join methods, and join order.
How the two real engines diverge
| Aspect | PostgreSQL | SQLite |
|---|---|---|
| Join search | near-exhaustive DP below geqo_threshold | polynomial graph planner (NGQP) always |
| Large joins | genetic optimizer (GEQO) at 12+ FROM items | same polynomial planner, 50+ way joins in microseconds |
| Join methods | nested loop, sort-merge, hash | nested loop only, left-deep |
| Statistics | pg_statistic via ANALYZE (MCV + histogram) | sqlite_stat1, sqlite_stat4 histograms |
| Inspect a plan | EXPLAIN, EXPLAIN ANALYZE | EXPLAIN QUERY PLAN |
PostgreSQL runs the Selinger-style dynamic program only while the FROM list has fewer than
geqo_threshold items (default 12); at or above that, the genetic query optimizer takes over,
encoding a join order as an integer-string chromosome, framing the search as a Traveling-Salesman-style
problem, and breeding better orders with the standard cost model as fitness
[PG GEQO]. SQLite never does
exponential search at all: its Next Generation Query Planner is "an efficient polynomial-time graph
algorithm" that plans 50- to 60-way joins in microseconds, accepting that it may miss the optimal order
[SQLite optimizer overview]. The fastest way to
catch a bad estimate in PostgreSQL is EXPLAIN ANALYZE: a large gap between estimated and
actual rows is the tell that a selectivity guess went wrong.
Deeper: PostgreSQL's worked estimate for two ANDed predicates
Over the 10000-row tenk1 table, the docs estimate
unique1 < 1000 AND stringu1 = 'xxx'. The range part comes from the equi-depth
histogram, selectivity 0.100697 (the worked example earlier). The equality part is a
non-MCV value, so its probability is the leftover mass spread over the leftover distinct values:
(1 - sum(mcv_freqs)) / (num_distinct - num_mcv) = (1 - 0.03033)/(676 - 10) = 0.0014559.
Then independence multiplies: 0.100697 * 0.0014559 = 0.0001466, so
rows = 10000 * 0.0001466 = 1
[PG row estimation
examples]. If unique1 and stringu1 were correlated, that product would be
a large underestimate, which is exactly the case CREATE STATISTICS exists to fix.
Check yourself
Which statement about the optimizer's goal is correct?
It never runs candidate plans (running is the cost it avoids), and it cannot prove optimality under guessed statistics. Selinger's validation found absolute predictions inaccurate but the ranking usually right. Cost is a unitless weighted number, not seconds.
Which of these is false?
A non-clustered index chases roughly one page per matching tuple, so for low
selectivity (many rows matched) its cost F*(NINDX+NCARD) overtakes the full scan
TCARD/P. The optimizer compares costs and switches at the crossover; the always-index rule
is exactly the design that gets this wrong.
Under the independence assumption, how is the selectivity of p1 AND p2 computed?
ANDed selectivities multiply under independence. The inclusion-exclusion form
F1 + F2 - F1*F2 is for OR, not AND. On correlated columns the
product becomes a large underestimate, which is the failure that motivates extended statistics.
Which statement about the System R join search is false?
System R excludes bushy trees; it searches only left-deep plans to keep one pipelined composite in flight. The other three statements are exactly what the dynamic program does. Some modern systems add bushy plans, but the classic optimizer does not.
Why does the optimizer keep a plan for every interesting order, even a non-cheapest one?
An interesting order (ORDER BY, GROUP BY, or any join column) can be reused downstream. Carrying a slightly pricier scan that delivers sorted output can repay itself when a sort-merge join then avoids sorting. Pushdown is a separate, unconditional rewrite, not the reason here.
Which statement about histograms and statistics is false?
n_distinct = -1 means the column is fully distinct: one distinct value
per row. The negative encoding expresses the count as a fraction of table size so it stays valid as the
table grows. The other three describe equi-depth, equi-width, and the MCV-plus-histogram split
correctly.
The System R optimizer, and the template every cost-based optimizer still follows. Focus
on Section 4 (the COST = PAGE FETCHES + W * RSI CALLS model and the Table 1 selectivity
constants 1/10, 1/3, 1/4) and Section 5 (the dynamic-programming
join search, the left-deep restriction, and interesting orders).
Want to see the cost crossover worked out with real TCARD and NCARD numbers,
or trace the DP on a five-table query by hand? Ask me to push harder on selectivity arithmetic, to draw
the lattice differently, or to quiz you on the "which is false" traps until they are automatic. I am your
teacher for this course, not just a document.