Lesson 11 · Query optimization: cost, selectivity, and join ordering

By the end you can

Explain why a declarative query forces the engine to search an exponential plan space, and why a fixed rule list is not enough.
Read and apply the System R cost model and selectivity factors, and say where the uniformity and independence assumptions break.
Walk the Selinger dynamic program over relation subsets, the left-deep restriction, and why interesting orders are kept.
Tell a clustered scan from a non-clustered one by cost, and find the selectivity crossover where a full scan beats an index.
Answer the "which is false" traps about optimality, index preference, and ANDed selectivities.

Stay on the one query the course follows down the stack:

SELECT name FROM emp WHERE salary > 75000 ORDER BY name;

The parser and rewriter (week 8) already handed this layer a logical plan: a projection of name over a selection of salary > 75000 over a scan of emp, with a sort for the ORDER BY. A logical plan says what, not how. It does not say whether to read emp with a sequential scan or an index on salary, and it does not say whether the sort can be skipped because some access path already returns rows in name order. This week owns those decisions. The optimizer turns the logical plan into a physical plan, a tree of concrete operators, and hands it to the Volcano executor (week 9), which runs it as a tree of open / next / close iterators. When the optimizer chooses an index scan, it leans on the B+tree (week 6); when it costs a join, it picks among the algorithms from week 10. Every page-count term in its cost model is a request that will eventually go through the buffer pool (week 4) to storage (week 2).

So the optimizer sits in the middle of journey 1. Above it: a logical algebra tree. Below it: an executor that will faithfully run whatever physical tree it is handed, fast plan or catastrophe alike. The executor never second-guesses. That is exactly why this layer carries the whole weight of the decision.

Intuition

Think of the optimizer as a travel planner with no way to test-drive any route. It cannot run a plan to see how slow it is, because running it is the expensive thing you are trying to avoid. So it estimates: it guesses how many rows survive each filter, multiplies those guesses by how many disk pages and CPU calls each operator will cost, adds it all up into one number, and compares numbers. The estimates are often wrong in absolute terms. The bet is that they are right enough in relative terms to rank a good plan above a disastrous one.

Why search at all

Three choices sit inside a query, and each is independent of the others. For every table you pick an access path: a full scan, or a scan of one of the available indexes. For every join you pick a method: nested loop, sort-merge, or hash. And you pick the order in which to join the tables. The choices multiply. For n tables in the FROM list there are already n! join orders, and Selinger names exactly this, "n factorial permutations of relation join orders," in the 1979 System R paper [Selinger 1979, §5]. Multiply by access paths per table and methods per join, and the plan space is far too large to walk.

The plans are not close together in cost. They differ by orders of magnitude. A plan that scans an index and merges might touch a few hundred pages; the same query run as full scans feeding a Cartesian product can touch billions, for the same answer. The single worst mistake is to evaluate a cross product first and filter afterward, so that an operator builds a huge intermediate result that the WHERE clause then throws almost all of away.

Why a fixed rule list is not enough

The naive design is a rule-based optimizer: a fixed priority list of transformations applied without looking at the data. The classic rule is "prefer an index over a full scan, always." It is cheap and predictable, and it is wrong whenever the predicate matches most of the table. Suppose salary > 75000 selects 90 percent of emp. A non-clustered index scan then chases roughly one heap page per matching row, while a full scan reads each page once. The index loses badly, but the rule cannot see that because it never asked how many rows survive. The data, not the query shape, decides. That is the case for a cost-based optimizer that estimates and compares, which is what System R, PostgreSQL, and SQLite all are at their core [PG query planning].

The cost model: one number from I/O and CPU

To compare plans you need a single comparable quantity. System R blends disk work and CPU work into one weighted sum:

# System R cost (Selinger 1979, section 4)
COST = PAGE_FETCHES + W * (RSI_CALLS)

PAGE_FETCHES is the I/O term, the number of pages read. RSI_CALLS is the number of tuples returned across the storage interface, standing in for CPU because "most of System R's CPU time is spent in the RSS" [Selinger 1979, §4]. W is an adjustable weight between I/O and CPU. The unit is artificial. Nobody claims the number is seconds. Only the relative ordering of two plans' costs is meaningful, which is the whole point: the optimizer needs to rank, not to predict wall-clock time.

Exam trap

"Cost is measured in seconds" is false, and so is "the optimizer finds the optimal plan." Cost is a unitless weighted number. And Selinger's own validation found the predicted-optimal plan was "often not accurate in absolute value," yet the ordering of estimated costs usually matched the ordering of measured costs [Selinger 1979, §7]. The optimizer finds a good plan under estimates and reliably dodges disasters. It does not prove optimality.

Single-relation access costs and why clustering matters

The selectivity factor F is the fraction of rows a predicate passes. The access-cost formulas (Selinger Table 2) show how an index changes the I/O term. TCARD is the number of data pages holding the relation, NCARD is its tuple count, NINDX is the number of pages in the index, and P is the fraction of data pages that actually hold this relation's tuples.

How the access path changes the page-fetch term (Selinger Table 2, CPU term omitted for clarity).
Access path	Page-fetch cost	Reads to scale with
Clustered index, matching predicate	`F * (NINDX + TCARD)`	data pages (TCARD)
Non-clustered index, matching predicate	`F * (NINDX + NCARD)`	matching tuples (NCARD)
Full (segment) scan	`TCARD / P`	every data page once

The structural point is the TCARD-versus-NCARD swap. A clustered index stores rows in roughly the index order, so a range scan touches the selected fraction of the data pages in sequence: cost tracks TCARD. A non-clustered index has no control over physical placement, so in the worst case every matching tuple sits on its own page, and the cost tracks the matching tuple count NCARD. Because NCARD is usually much larger than TCARD (many tuples per page), a non-clustered index gets expensive fast as F grows, and at some selectivity the full scan TCARD/P wins. That crossover is the entire reason clustering matters, and it is the death of the "index is always faster" rule.

Figure 1. The non-clustered index (indigo) starts cheap for tiny F but climbs steeply, because cost tracks the matching tuple count. Past the crossover (red), the flat full scan is cheaper. A rule that always picks the index lives on the wrong side of that line for any selective query that still matches a lot of rows.

Estimating selectivity: where the guesses come from

Every cost depends on F, so the optimizer must estimate it per predicate. System R's default selectivity factors (Selinger Table 1) are the canonical constants every later system started from:

System R default selectivity factors (Selinger Table 1).
Predicate	Selectivity F	Fallback when no stats
`col = value`	`1 / distinct_keys`	`1/10`
`col > value` (range)	linear interpolation in the key range	`1/3`
`col BETWEEN a AND b`	`(b - a) / (high - low)`	`1/4`
`p1 AND p2`	`F(p1) * F(p2)`	(multiply)
`p1 OR p2`	`F(p1) + F(p2) - F(p1)*F(p2)`	(inclusion-exclusion)
`NOT p`	`1 - F(p)`	(complement)

Two assumptions are baked in, and the paper names both. Uniformity: F = 1/distinct_keys for equality assumes "an even distribution of tuples among the index key values." Independence: the AND rule multiplies selectivities, which Selinger flags directly with "Note that this assumes that column values are independent" [Selinger 1979, §4]. Both fail on real data. Uniformity fails on skew (a few values dominate). Independence fails on correlation, where a second predicate adds almost no filtering once the first holds, so the product is a large underestimate.

Exam trap

"Under independence, ANDed selectivities are added" is false. They multiply: F(p1 AND p2) = F(p1) * F(p2). It is the OR case that adds (and then subtracts the overlap). Mixing these up is the most common selectivity mistake.

Histograms attack the uniformity failure

To stop pretending the distribution is flat, modern systems store a histogram of the column. Two layouts, and the difference is the whole lesson:

Equi-width: buckets cover equal value ranges; their row counts vary. A skewed column dumps most rows into one bucket, so the estimate inside that dense bucket is exactly where it is worst.
Equi-depth (equi-height): boundaries are chosen so each bucket holds roughly equal row counts; value ranges vary. Dense regions get many narrow buckets, sparse regions get few wide ones. Resolution lands where the data is.

PostgreSQL uses an equi-depth histogram for the smooth tail plus a separate most-common-values (MCV) list for the spikes, and the histogram is built only from values not already captured as MCVs, so the two structures partition the column [PG planner stats]. The docs work a range estimate for unique1 < 1000 over a 10000-row table with histogram bounds {0,993,1997,3050,...}:

selectivity = (1 + (1000 - 993)/(1997 - 993)) / 10 = 0.100697
rows        = 10000 * 0.100697 = 1007

One full bucket below the value, plus the interpolated fraction of the straddled bucket, divided by the bucket count. The first simulator below lets you skew a column and watch equi-width go wrong while equi-depth stays close.

Selectivity and histogram playground skew the column, type a predicate, watch the two histograms estimate

Experiment: drag the skew slider to the right so a few low values dominate, then estimate x < 30. Equi-width misses by a wide margin in the dense region; equi-depth tracks the true count. Then flip the AND toggle and watch the independence assumption underestimate the correlated pair.

In real systems

PostgreSQL keeps per-column stats in pg_statistic, readable through pg_stats: null_frac, n_distinct, most_common_vals, histogram_bounds, and correlation. They are filled by ANALYZE, which samples rather than reading the whole table, so they are "always approximate even when freshly updated" [PG planner stats]. The encoding of n_distinct is a trap of its own: a positive value is a literal count, a negative value is a fraction of table size, and n_distinct = -1 means every value is distinct (so the estimate survives table growth). To fight the independence failure, CREATE STATISTICS stores multivariate stats so correlated predicates are not naively multiplied.

Join ordering: the Selinger dynamic program

Join order is where the plan space explodes, and it is where System R's central idea lives. First, a restriction on the shape of the tree.

Figure 2. In a left-deep tree every join's inner (right) input is a base table, so one growing composite stays in flight and can be pipelined. A bushy tree lets a join feed off two intermediate results, exposing more (sometimes cheaper) plans at the cost of a larger search space and materialized intermediates. System R searches only left-deep.

The engine of the search is dynamic programming over subsets of relations. It rests on a substructure property the paper states plainly: "once the first k relations are joined, the method to join the composite to the k+1-st relation is independent of the order of joining the first k" [Selinger 1979, §5]. So the cheapest way to produce a given set of relations can be computed once and reused as a building block. Build bottom up: cheapest access path for each single relation, then cheapest join for each pair built from singletons, then each triple built from pairs, and so on to the full set.

For each single relation, find the cheapest access path (one per interesting order, plus the cheapest unordered path).
For each pair, find the cheapest join, reusing the single-relation solutions.
For each three-relation set, extend the best two-relation plans by joining one more relation.
At the top, pick the cheapest full-set plan that delivers any required output order.

The table holds at most 2^n subsets times the number of interesting orders, so the size is exponential in n but the constant is small. Selinger reports eight-table joins optimized in a few seconds on 1970s hardware. Two pruning rules shrink it further. Cartesian products are pushed late: the search only extends a composite with a relation that has a join predicate to something already in it, so for predicates on T1-T2 and T2-T3 the order T1-T3-T2 is never considered, since it would force a T1×T3 cross product. Dominated plans are discarded: keep only the cheapest plan per subset, except that you also keep the cheapest plan per interesting order.

Interesting orders: paying now to save a sort later

A sort order is "interesting" if a later operator could exploit it: the query's ORDER BY or GROUP BY columns, and every join column. The DP normally keeps one cheapest plan per subset, but it must also keep the cheapest plan for each interesting order, even when that plan is not the globally cheapest. Why pay for a pricier scan now? Because a plan that already produces tuples sorted on a join column lets a downstream sort-merge join skip its own sort, which can more than repay the cost. This bookkeeping is what makes System R more than a naive least-cost search, and it is the contribution the paper itself highlights. Recall from week 10 that sort-merge join is cheap when its inputs arrive sorted and expensive when it has to sort them, so an interesting order is exactly the lever that flips which join algorithm wins.

Selinger DP join-order visualizer fill the subset table level by level, prune cross products, toggle interesting orders

Experiment: step through the levels and watch which pairs are struck out because they have no join predicate (a forced Cartesian product). Then toggle interesting orders and watch the final chosen plan change even though the cardinalities did not, because a pre-sorted input now lets the last join skip a sort.

Keep this one thing

The optimizer never runs a plan to score it. It estimates how many rows each operator emits (selectivity), turns that into a single comparable cost number, and uses dynamic programming to build the cheapest join order out of cheapest sub-plans. It picks a good plan under guesses, not a provably optimal one.

Heuristic rewrites sit above the search

Some transformations are almost always good, so the engine applies them unconditionally as rewrites rather than costing them as alternatives. Predicate pushdown moves a WHERE filter as close to the base scan as possible, so rows are discarded before they enter a join. This is the highest-value rewrite, because it shrinks every input downstream. Projection pushdown drops columns that nothing downstream needs, so intermediate tuples are narrower and more fit per page. Join elimination removes a join that cannot change the result, such as a foreign-key lookup whose columns are unused and whose key is guaranteed present. The week 8 rewriter already did some of this; the rest happens here, and it interacts with cost only through the smaller inputs it produces.

Exam trap

"Predicate pushdown is a cost-based decision" is false. Pushdown, projection pushdown, and join elimination are heuristic rewrites applied unconditionally because they are essentially always beneficial. They are not costed against an alternative. The cost-based machinery is for access paths, join methods, and join order.

How the two real engines diverge

Same cost-based core, very different join-search strategies.
Aspect	PostgreSQL	SQLite
Join search	near-exhaustive DP below `geqo_threshold`	polynomial graph planner (NGQP) always
Large joins	genetic optimizer (GEQO) at 12+ FROM items	same polynomial planner, 50+ way joins in microseconds
Join methods	nested loop, sort-merge, hash	nested loop only, left-deep
Statistics	`pg_statistic` via `ANALYZE` (MCV + histogram)	`sqlite_stat1`, `sqlite_stat4` histograms
Inspect a plan	`EXPLAIN`, `EXPLAIN ANALYZE`	`EXPLAIN QUERY PLAN`

In real systems

PostgreSQL runs the Selinger-style dynamic program only while the FROM list has fewer than geqo_threshold items (default 12); at or above that, the genetic query optimizer takes over, encoding a join order as an integer-string chromosome, framing the search as a Traveling-Salesman-style problem, and breeding better orders with the standard cost model as fitness [PG GEQO]. SQLite never does exponential search at all: its Next Generation Query Planner is "an efficient polynomial-time graph algorithm" that plans 50- to 60-way joins in microseconds, accepting that it may miss the optimal order [SQLite optimizer overview]. The fastest way to catch a bad estimate in PostgreSQL is EXPLAIN ANALYZE: a large gap between estimated and actual rows is the tell that a selectivity guess went wrong.

Deeper: PostgreSQL's worked estimate for two ANDed predicates

Over the 10000-row tenk1 table, the docs estimate unique1 < 1000 AND stringu1 = 'xxx'. The range part comes from the equi-depth histogram, selectivity 0.100697 (the worked example earlier). The equality part is a non-MCV value, so its probability is the leftover mass spread over the leftover distinct values: (1 - sum(mcv_freqs)) / (num_distinct - num_mcv) = (1 - 0.03033)/(676 - 10) = 0.0014559. Then independence multiplies: 0.100697 * 0.0014559 = 0.0001466, so rows = 10000 * 0.0001466 = 1 [PG row estimation examples]. If unique1 and stringu1 were correlated, that product would be a large underestimate, which is exactly the case CREATE STATISTICS exists to fix.

Check yourself

Which statement about the optimizer's goal is correct?

It never runs candidate plans (running is the cost it avoids), and it cannot prove optimality under guessed statistics. Selinger's validation found absolute predictions inaccurate but the ranking usually right. Cost is a unitless weighted number, not seconds.

Which of these is false?

A non-clustered index chases roughly one page per matching tuple, so for low selectivity (many rows matched) its cost F*(NINDX+NCARD) overtakes the full scan TCARD/P. The optimizer compares costs and switches at the crossover; the always-index rule is exactly the design that gets this wrong.

Under the independence assumption, how is the selectivity of p1 AND p2 computed?

ANDed selectivities multiply under independence. The inclusion-exclusion form F1 + F2 - F1*F2 is for OR, not AND. On correlated columns the product becomes a large underestimate, which is the failure that motivates extended statistics.

Which statement about the System R join search is false?

System R excludes bushy trees; it searches only left-deep plans to keep one pipelined composite in flight. The other three statements are exactly what the dynamic program does. Some modern systems add bushy plans, but the classic optimizer does not.

Why does the optimizer keep a plan for every interesting order, even a non-cheapest one?

An interesting order (ORDER BY, GROUP BY, or any join column) can be reused downstream. Carrying a slightly pricier scan that delivers sorted output can repay itself when a sort-merge join then avoids sorting. Pushdown is a separate, unconditional rewrite, not the reason here.

Which statement about histograms and statistics is false?

n_distinct = -1 means the column is fully distinct: one distinct value per row. The negative encoding expresses the count as a fraction of table size so it stays valid as the table grows. The other three describe equi-depth, equi-width, and the MCV-plus-histogram split correctly.

Why does a non-clustered index get expensive faster than a clustered one as selectivity grows?

Its cost tracks the matching tuple count (NCARD), because matching rows can be one per page. A clustered index tracks data pages (TCARD), since rows sit in index order. NCARD is usually much larger than TCARD, so the non-clustered cost climbs steeply and eventually loses to the full scan.

State the substructure property that makes the Selinger join search a dynamic program.

Once the first k relations are joined, the cheapest way to add the (k+1)-th relation is independent of the order in which the first k were joined. So the best plan for each relation subset can be computed once and reused as a building block for larger subsets.

What two assumptions in System R's default selectivity factors fail on real data, and how?

Uniformity (equality selectivity = 1/distinct) fails on skew, where a few values dominate. Independence (ANDed selectivities multiply) fails on correlation, where the product badly underestimates because the second predicate barely filters once the first holds.

Primary source

Read this next: Selinger et al., Access Path Selection in a Relational Database Management System (SIGMOD 1979)

The System R optimizer, and the template every cost-based optimizer still follows. Focus on Section 4 (the COST = PAGE FETCHES + W * RSI CALLS model and the Table 1 selectivity constants 1/10, 1/3, 1/4) and Section 5 (the dynamic-programming join search, the left-deep restriction, and interesting orders).

Ask your teacher

Want to see the cost crossover worked out with real TCARD and NCARD numbers, or trace the DP on a five-table query by hand? Ask me to push harder on selectivity arithmetic, to draw the lattice differently, or to quiz you on the "which is false" traps until they are automatic. I am your teacher for this course, not just a document.