Query execution: the Volcano iterator model
The optimizer just handed you a tree of operators. How does that tree turn into rows on the wire without writing a temporary table at every step, and without each operator knowing what feeds it?
By the end you can
- Explain why materializing each operator's full output is correct but wasteful, and what the iterator model trades it for.
- Walk the open / next / close protocol on a real plan tree and say who pulls whom.
- State why one uniform interface buys composability, bounded memory, and pipelining, and what the per-tuple call costs.
- Compute the I/O cost of naive, block, and index nested loop joins and say when an index scan loses to a sequential scan.
- Answer the Volcano paper's viva questions on state records, support functions, and anonymous streams.
Keep the course's running query in view:
SELECT name FROM emp WHERE salary > 75000 ORDER BY name;
By the time it reaches this lesson the string has already been turned into a tree. Week 8 lexed, parsed, and bound it into a typed query tree and a logical plan: projection(name) over selection(salary > 75000) over scan(emp), with a sort for the ORDER BY. Week 11 will turn that logical plan into a physical plan, choosing a sequential scan or an index scan on salary and deciding whether the sort can be skipped. This week sits between them in the pipeline but is taught first, because you cannot reason about which plan is cheapest until you know how a plan actually runs.
So the layer above (the optimizer) hands this layer a finished physical plan: a tree of concrete operators. This layer hands down a stream of page requests. When the scan at the bottom of the tree needs data, it asks the access method (weeks 6 and 7) for tuples, which asks the buffer pool (weeks 4 and 5) for pages by id, which asks the storage manager (weeks 2 and 3) to read bytes off disk. The whole stack is the spine from week 1. Our job here is the middle of journey 1: the executor.
Picture a bucket brigade, but reversed. Instead of the bottom of the line passing buckets up whether the top wants them or not, the person at the top says "next" and that request travels all the way down to the well. One bucket of water comes back up the line, hand to hand. Nobody fills a reservoir. Nobody moves until asked. That single-bucket, on-demand discipline is the iterator model.
The problem: do not build a table you were never asked for
The executor receives a tree of physical operators and must produce tuples. The most obvious way to run a tree is to evaluate it bottom up. Run the scan, write all of its output to a temporary table. Run the filter over that table, write its output to another temporary table. Run the projection, write a third. Hand the last table to the client. This is the materialization model, and it is perfectly correct.
It has two costs that sink it. The first is memory and I/O. Each operator writes its whole output and the next operator rereads it. A five-operator plan over a billion-row table writes and rereads a billion rows four times to produce results nobody asked to keep. The second cost is uniformity. If the join operator hard-codes the fact that its input is a temporary file on disk, you cannot feed it the output of an index scan or another join without rewriting it. Operators stop composing.
The constraint: the executor must compose arbitrary operators and must not pay for intermediate results the query never requested. The naive fix (materialize each step) breaks both. So invert the control flow. Instead of each operator producing its whole result and pushing it, let each operator expose one tiny interface that produces exactly one tuple when asked, and pulls from its own input one tuple at a time. Then a tuple can travel from scan to filter to projection without ever touching disk, and any operator can sit on top of any other because they all speak the same three-call protocol. That protocol is open / next / close, the Volcano iterator interface [Graefe 1994].
The mechanism: open, next, close
In the iterator model every operator, leaf or internal, implements the same three procedures. Graefe frames them as the four parts of a loop over a result set: initialization, the increment, the termination test, and the final housekeeping [Graefe 1994, Section III.B].
open(): set up state, recursively open the inputs, allocate any buffers or hash tables this operator needs.next(): return the next output tuple, or an end-of-stream marker when there are no more. Callnext()on an input only when more input is needed.close(): release state and recursively close the inputs.
Execution is demand-driven and top-down by control, bottom-up by data. The consumer at the root
calls next(). That call recurses down to the leaves. One tuple bubbles back up per call.
The PostgreSQL executor states the same contract: each node, when called, produces the next tuple in
its output sequence or NULL if none remain, and a non-leaf node calls its child nodes in turn to obtain
input [PostgreSQL
executor README]. Parents pull. Children never push.
next() at the root pulls exactly one tuple from the leaf. Notice the
two arrows are the same edges traversed in opposite senses: the recursion descends, then unwinds with
a tuple.Two design choices in that picture deserve a name. First, an operator never learns what kind of
operator feeds it. It just calls next() on an input pointer. Graefe calls these
anonymous inputs or streams, and they are the reason any operator nests inside any
other [Graefe 1994, Section III.B]. Second, all of an
iterator's per-instance data lives in a state record, with no static variables. That invariant
is what lets the same operator code (say, sort) appear several times in one plan: each occurrence gets
its own state record [Graefe 1994, Section III.B].
The executor is a tree of iterators that all speak open / next / close. The root pulls one tuple at a time and the request recurses to the leaves. Pull, not push. One tuple, not a table.
Watch one tuple bubble up
The simulator below runs our query's plan: SeqScan(emp) feeding Filter(salary > 75000) feeding
Projection(name). Click Step (or Play) to issue one next() at the root and watch it travel
down and a tuple travel back up. Each operator shows the tuple it currently holds, and a counter tallies
how many next() calls each operator has fielded. Filter and Projection never hold more than
one tuple, which is the whole point.
It is tempting to say the Volcano model pushes tuples from producers up to consumers. It does not.
Volcano is demand-driven: the consumer calls next() and the producer responds. Push is
the opposite control flow and is the alternative model (the basis of query compilation), not Volcano
[Graefe 1994].
One interface, many operators
The reason a handful of operators cover so much SQL is that Volcano splits each operator into an algorithm shell and its support functions. The shell is the open / next / close loop. The support functions do all the item-level work: evaluate a predicate, hash a key, compare two records, compute a projected column. The shell is empty without them; a support function is just a function pointer plus a typeless argument, and because all interpretation of data items is imported this way, Volcano needs no type system for instances and is data-model independent [Graefe 1994, Sections III.A, IV].
That split is why three of our operators are really one. Volcano's filter operator carries
three optional support functions: a predicate (selection), a transform (projection), and an apply
function called once per record for side effects (updates, printing). So Filter and
Project in the course are the same physical operator with a different support function
plugged in [Graefe 1994, Section III.B.1]. The
SeqScan is the file-scan iterator. And the entire join family (join, semi-join, outer
joins, intersection, union, difference, aggregation, duplicate elimination) is one operator,
one-to-one match, because they all reduce to the same step: separate the matching from the
non-matching subsets of two inputs and emit a selected, possibly transformed subset
[Graefe 1994, Section III.B.2].
# the iterator contract, as pseudocode. Every operator looks like this.
class Filter(Iterator):
def open(self):
self.child.open() # recurse down
def next(self):
while True:
row = self.child.next() # pull one tuple from the anonymous input
if row is EOF:
return EOF
if self.predicate(row): # support function, imported
return row # pass it up; O(1) state, one row held
# else drop it and loop: pull again
def close(self):
self.child.close() # recurse down
Three properties fall out of pull, and they are the reason every textbook executor uses it.
Composability: a join's next() just calls its children's next(), so it does
not care what the child is. Bounded memory: Filter, Projection, and the probe side of a join hold one
tuple at a time, not the whole input, so a streaming plan runs in tiny memory. Pipelining: a tuple can
travel scan to filter to projection without ever being written to disk
[PG planner]. The cost
is one function call per operator per tuple. For a billion tuples that is billions of virtual calls,
which is exactly why modern analytic engines move to vectorized or compiled push execution. You will
meet those in week 15.
Where the pipeline has to stop
Not every operator can stream. A pipeline breaker must consume its entire input before it
can emit a single output tuple. Our running query has one: the ORDER BY name sort. Sort
cannot know which tuple is smallest until it has seen the last input tuple, so it pulls every tuple from
below, buffers them (spilling to disk via external merge sort if they exceed memory), and only then
starts returning them in order. The build side of a hash join is the other classic breaker: the hash
table must be fully built before any probe can match
[CMU 15-445 joins].
Breakers are where a query's memory budget is set, because that is where a whole input (or a hash
table, or sorted runs) must live. In PostgreSQL that budget is work_mem, the per-operation
memory before a sort or hash spills to temporary disk files
[PG resource config].
External sort, the hash build, and grace partitioning are the subject of
week 10; here just hold the distinction: Filter and
Projection are streaming with O(1) state, Sort and the hash build are breakers.
| Operator | State held | Streams or breaks |
|---|---|---|
| SeqScan, IndexScan | one cursor | streams |
| Filter, Projection | one tuple | streams |
| Index nested loop, merge step | one tuple per side | streams |
| Sort (ORDER BY) | whole input or sorted runs | breaks |
| Hash join build side | whole hash table | breaks |
| Aggregation, DISTINCT | one entry per group, or whole sort | breaks |
The cost of the simplest join
Our running query has no join, but most queries do, and the join is where the iterator model first meets real I/O cost. A join of R and S is conceptually the Cartesian product filtered by a predicate. Done literally that is |R| times |S| comparisons. The nested loop join is the most direct way to run it as iterators, and its three variants show how much the same logical join can cost depending on the physical choice.
Let R (the outer) have M pages holding m tuples, and S (the inner) have N pages. The naive nested loop rereads all of S once per outer tuple:
naive nested loop: M + (m × N)
The m × N term is the killer. The block nested loop rereads S
once per outer block instead of per tuple, giving the outer scan B - 2 buffer frames (one for an inner
page, one for output):
block nested loop: M + (ceil(M / (B - 2)) × N)
The index nested loop replaces the inner scan with an index probe at roughly constant cost C per outer tuple:
index nested loop: M + (m × C)
On the standard CMU example (M = 1000, m = 100000, N = 500, B = 100, 0.1 ms per I/O) the naive form is about 1.4 hours, the block form about 6.5 seconds, and the right index turns it into a few hundred milliseconds [CMU 15-445 joins]. Same logical join, four orders of magnitude apart. The simulator below lets you feel the block term move as you change the buffer count.
PostgreSQL is a textbook pull-based Volcano executor. A plan is a tree of Plan nodes;
at execution each becomes a PlanState. ExecInitNode recursively initializes
the tree (the open step), ExecProcNode pulls the next tuple from a node (the next step),
and ExecEndNode tears it down (the close step). Tuples travel between operators in a
TupleTableSlot, one per call, exactly the iterator contract
[PostgreSQL
executor README]. SQLite is different: it compiles each statement to bytecode for a virtual
machine (the VDBE) and the executor is that interpreter, not a tree of operator objects. SQLite also
implements joins as nested loops only, with no hash or merge join, and leans on indexes and a
polynomial-time join-ordering algorithm to make them fast
[SQLite query optimizer overview].
An index scan is not always faster than a sequential scan. For a low-selectivity predicate (one that matches most of the table) every index match triggers a potentially random heap fetch, and reading 60 percent of a table through an index can be slower than one ordered sequential pass. PostgreSQL always keeps a sequential scan plan available for exactly this reason [PG planner]. The optimizer in week 11 decides between them using estimated selectivity.
Deeper: who owns the tuple in the buffer, and why Volcano avoids temp files
When next() returns a record it returns a status plus a Next-Record structure: an
RID and an address of the record pinned in the buffer pool. The ownership invariant is that each
pinned record is owned by exactly one operator at a time. On receiving a record an operator may hold
it (in a hash table), unfix it (when a predicate fails), or pass it on; operators that build new
records, such as a join, fix their output and unfix their inputs
[Graefe 1994, Section III.B]. This is the
pin-count machinery from week 4 seen from the executor's
side.
Volcano deliberately does not pass intermediate results through temporary files between operators, as some textbooks suggest, because that carries a substantial performance penalty. Intermediate streams live on virtual devices whose pages exist only in the buffer and vanish when unpinned, so the same buffer machinery serves both permanent and intermediate data while keeping it in memory [Graefe 1994, Sections III.A, III.B]. The one place data does hit disk is a pipeline breaker that overflows memory, which is the spill behavior of week 10.
Paper viva: Volcano
This week's assigned paper is Graefe's "Volcano: An Extensible and Parallel Query Evaluation System" (IEEE TKDE 6(1), 1994). The examiner will not ask you to recite it. They will ask you to defend why the design is shaped the way it is. Try to answer each before opening the model answer.
Name the three procedures of the iterator interface, say what each does, and say who initiates the work in a running plan.
Model answer
The three procedures are open, next, and close. open initializes the operator's
state (for example allocates a hash table) and recursively opens all of its inputs, so every
iterator in a query is initiated recursively. next produces and returns one result
record at a time, calling next on its inputs only when it needs more input, and signals
end-of-stream when its source is exhausted; this is lazy, demand-driven evaluation. close
releases resources and recursively shuts down the inputs. The consumer initiates the work: the root
operator's next is called repeatedly, and each operator pulls from below only on demand.
So control flows top-down and data flows bottom-up over the same call stack
[Graefe 1994, Section III.B].
Why must every operator share exactly one interface, and what does that uniformity buy? Use the terms anonymous streams and state records in your answer.
Model answer
Because an operator only ever calls open / next / close on its input, it never needs to know what produces that input. Graefe calls these anonymous inputs or streams. Uniformity buys three things. Composability: any operators nest in any combination, so the optimizer can reorder and stack them freely. Extensibility: a new operator is added without touching existing ones, because they depend only on the interface. Orthogonal parallelism: a single special operator (exchange) can be inserted anywhere to parallelize a subtree, with no change to the operators around it. The state record makes this concrete: all of an iterator's arguments and runtime state live in its state record with no static variables, and operators are linked by input pointers stored in those records. That is why the same algorithm, such as sort, can appear several times in one plan, each instance with its own state record [Graefe 1994, Sections III.B, IV, VI].
What is a support function, and why is it the key to Volcano being data-model independent? Tie it to how SeqScan, Filter, Project, and Join map onto Volcano's operators.
Model answer
A support function is an imported function entry point plus a typeless argument that performs all
item-level work: predicates, hashing, comparison, projection, and side effects. Volcano's iterators
are empty algorithm shells without them. Because all interpretation of data items is imported this
way, Volcano needs no instance type system and works for any data type or model; adding a new
abstract data type does not change the Volcano code at all
[Graefe 1994, Sections III.A, IV]. The course
operators map directly. SeqScan is the file-scan iterator. Filter is the
filter operator's predicate function and Project is its transform function, so Volcano
fuses selection, projection, and side effects into one filter operator. Join is the
one-to-one match operator, with hybrid hash and sort-merge implementations whose build, probe, and
flush phases map onto open (build) and next (probe then flush)
[Graefe 1994, Sections III.B.1, III.B.2].
Check yourself
Which statement about the Volcano iterator model is false?
Volcano is pull, not push. The consumer calls next() and the producer responds on demand. Pushing tuples eagerly upward is the opposite control flow and is the alternative model (the basis of query compilation), not Volcano. The other three statements all describe pull correctly.
Why can the same operator code, such as sort, appear several times in one query plan?
Volcano keeps all of an iterator instance's arguments and runtime state in its own state record, with no static variables. So two sort instances in one plan simply get two state records and never collide. Global or static state would make a second instance overwrite the first.
Which statement is false about nested loop join cost?
The block variant rereads the inner table once per outer block, not per outer tuple. Per-tuple rereading is the naive form. That distinction is exactly why adding buffers (a bigger block) shrinks the block join and leaves the naive join untouched.
A Filter operator sits above a SeqScan. How much state does Filter hold while running?
Filter is a streaming operator with O(1) state. It pulls one tuple, tests the predicate, returns it or drops it and loops. It never needs to see more than the current tuple. The operators that must buffer the whole input are pipeline breakers such as sort and the hash build side.
Which statement about an index scan versus a sequential scan is false?
An index scan is not always faster. When the predicate matches a large fraction of the table, each match triggers a potentially random heap fetch, and that can cost more than one ordered sequential pass. PostgreSQL keeps a sequential scan plan available precisely for this case.
Why does the ORDER BY in our running query force a pipeline breaker?
A sort cannot know which tuple is smallest until it has consumed every input tuple, so it must materialize the whole input (spilling via external merge sort if it overflows memory) before returning the first sorted row. That inability to emit early is what defines a pipeline breaker.
Read Section III.B for the open / next / close protocol, anonymous streams, and state records, and Section III.A for support functions. This is the origin of the iterator model and the assigned paper for this week's viva. doi.org/10.1109/69.273032. Open-access mirror: volcano.pdf.
Want me to trace a plan with a join through the stepper, contrast pull with the push and vectorized models you will meet in week 15, or quiz you harder on the nested loop cost formulas? Ask. I am your teacher for this course, not just a document.