Field Guide / Part V · Query processing / Week 09
Query processor

Query execution: the Volcano iterator model

The optimizer just handed you a tree of operators. How does that tree turn into rows on the wire without writing a temporary table at every step, and without each operator knowing what feeds it?

By the end you can

  • Explain why materializing each operator's full output is correct but wasteful, and what the iterator model trades it for.
  • Walk the open / next / close protocol on a real plan tree and say who pulls whom.
  • State why one uniform interface buys composability, bounded memory, and pipelining, and what the per-tuple call costs.
  • Compute the I/O cost of naive, block, and index nested loop joins and say when an index scan loses to a sequential scan.
  • Answer the Volcano paper's viva questions on state records, support functions, and anonymous streams.

Keep the course's running query in view:

SELECT name FROM emp WHERE salary > 75000 ORDER BY name;

By the time it reaches this lesson the string has already been turned into a tree. Week 8 lexed, parsed, and bound it into a typed query tree and a logical plan: projection(name) over selection(salary > 75000) over scan(emp), with a sort for the ORDER BY. Week 11 will turn that logical plan into a physical plan, choosing a sequential scan or an index scan on salary and deciding whether the sort can be skipped. This week sits between them in the pipeline but is taught first, because you cannot reason about which plan is cheapest until you know how a plan actually runs.

So the layer above (the optimizer) hands this layer a finished physical plan: a tree of concrete operators. This layer hands down a stream of page requests. When the scan at the bottom of the tree needs data, it asks the access method (weeks 6 and 7) for tuples, which asks the buffer pool (weeks 4 and 5) for pages by id, which asks the storage manager (weeks 2 and 3) to read bytes off disk. The whole stack is the spine from week 1. Our job here is the middle of journey 1: the executor.

Intuition

Picture a bucket brigade, but reversed. Instead of the bottom of the line passing buckets up whether the top wants them or not, the person at the top says "next" and that request travels all the way down to the well. One bucket of water comes back up the line, hand to hand. Nobody fills a reservoir. Nobody moves until asked. That single-bucket, on-demand discipline is the iterator model.

The problem: do not build a table you were never asked for

The executor receives a tree of physical operators and must produce tuples. The most obvious way to run a tree is to evaluate it bottom up. Run the scan, write all of its output to a temporary table. Run the filter over that table, write its output to another temporary table. Run the projection, write a third. Hand the last table to the client. This is the materialization model, and it is perfectly correct.

It has two costs that sink it. The first is memory and I/O. Each operator writes its whole output and the next operator rereads it. A five-operator plan over a billion-row table writes and rereads a billion rows four times to produce results nobody asked to keep. The second cost is uniformity. If the join operator hard-codes the fact that its input is a temporary file on disk, you cannot feed it the output of an index scan or another join without rewriting it. Operators stop composing.

Why it works this way

The constraint: the executor must compose arbitrary operators and must not pay for intermediate results the query never requested. The naive fix (materialize each step) breaks both. So invert the control flow. Instead of each operator producing its whole result and pushing it, let each operator expose one tiny interface that produces exactly one tuple when asked, and pulls from its own input one tuple at a time. Then a tuple can travel from scan to filter to projection without ever touching disk, and any operator can sit on top of any other because they all speak the same three-call protocol. That protocol is open / next / close, the Volcano iterator interface [Graefe 1994].

The mechanism: open, next, close

In the iterator model every operator, leaf or internal, implements the same three procedures. Graefe frames them as the four parts of a loop over a result set: initialization, the increment, the termination test, and the final housekeeping [Graefe 1994, Section III.B].

Execution is demand-driven and top-down by control, bottom-up by data. The consumer at the root calls next(). That call recurses down to the leaves. One tuple bubbles back up per call. The PostgreSQL executor states the same contract: each node, when called, produces the next tuple in its output sequence or NULL if none remain, and a non-leaf node calls its child nodes in turn to obtain input [PostgreSQL executor README]. Parents pull. Children never push.

Projection(name) Filter(sal>75000) SeqScan(emp) next() next() control flows down one tuple one tuple tuples flow up ask access method / buffer pool for the next heap page
Figure 1. The same call stack carries control down (left, indigo) and data up (right, green). One next() at the root pulls exactly one tuple from the leaf. Notice the two arrows are the same edges traversed in opposite senses: the recursion descends, then unwinds with a tuple.

Two design choices in that picture deserve a name. First, an operator never learns what kind of operator feeds it. It just calls next() on an input pointer. Graefe calls these anonymous inputs or streams, and they are the reason any operator nests inside any other [Graefe 1994, Section III.B]. Second, all of an iterator's per-instance data lives in a state record, with no static variables. That invariant is what lets the same operator code (say, sort) appear several times in one plan: each occurrence gets its own state record [Graefe 1994, Section III.B].

Keep this one thing

The executor is a tree of iterators that all speak open / next / close. The root pulls one tuple at a time and the request recurses to the leaves. Pull, not push. One tuple, not a table.

Watch one tuple bubble up

The simulator below runs our query's plan: SeqScan(emp) feeding Filter(salary > 75000) feeding Projection(name). Click Step (or Play) to issue one next() at the root and watch it travel down and a tuple travel back up. Each operator shows the tuple it currently holds, and a counter tallies how many next() calls each operator has fielded. Filter and Projection never hold more than one tuple, which is the whole point.

Single-tuple iterator stepper press Step to pull one tuple through the plan
Experiment to run: step several times and notice that the scan is called more often than the projection. Rows that fail salary > 75000 are dropped at the filter, so the scan must be pulled repeatedly before one tuple reaches the top. That gap is the filter's selectivity made visible.
Exam trap

It is tempting to say the Volcano model pushes tuples from producers up to consumers. It does not. Volcano is demand-driven: the consumer calls next() and the producer responds. Push is the opposite control flow and is the alternative model (the basis of query compilation), not Volcano [Graefe 1994].

One interface, many operators

The reason a handful of operators cover so much SQL is that Volcano splits each operator into an algorithm shell and its support functions. The shell is the open / next / close loop. The support functions do all the item-level work: evaluate a predicate, hash a key, compare two records, compute a projected column. The shell is empty without them; a support function is just a function pointer plus a typeless argument, and because all interpretation of data items is imported this way, Volcano needs no type system for instances and is data-model independent [Graefe 1994, Sections III.A, IV].

That split is why three of our operators are really one. Volcano's filter operator carries three optional support functions: a predicate (selection), a transform (projection), and an apply function called once per record for side effects (updates, printing). So Filter and Project in the course are the same physical operator with a different support function plugged in [Graefe 1994, Section III.B.1]. The SeqScan is the file-scan iterator. And the entire join family (join, semi-join, outer joins, intersection, union, difference, aggregation, duplicate elimination) is one operator, one-to-one match, because they all reduce to the same step: separate the matching from the non-matching subsets of two inputs and emit a selected, possibly transformed subset [Graefe 1994, Section III.B.2].

Iterator open() · next() · close() · inputs[] SeqScan leaf, predicate Filter predicate fn Project transform fn Join one-to-one match
Figure 2. Every operator subclasses the same iterator interface. Filter and Project are the same shell with a different support function. The join is one general matching operator. New operators slot in without touching the others, because the interface, not the operator type, is what each parent depends on.
# the iterator contract, as pseudocode. Every operator looks like this.
class Filter(Iterator):
    def open(self):
        self.child.open()                 # recurse down
    def next(self):
        while True:
            row = self.child.next()        # pull one tuple from the anonymous input
            if row is EOF:
                return EOF
            if self.predicate(row):       # support function, imported
                return row                 # pass it up; O(1) state, one row held
            # else drop it and loop: pull again
    def close(self):
        self.child.close()                 # recurse down
Why pull is the right abstraction

Three properties fall out of pull, and they are the reason every textbook executor uses it. Composability: a join's next() just calls its children's next(), so it does not care what the child is. Bounded memory: Filter, Projection, and the probe side of a join hold one tuple at a time, not the whole input, so a streaming plan runs in tiny memory. Pipelining: a tuple can travel scan to filter to projection without ever being written to disk [PG planner]. The cost is one function call per operator per tuple. For a billion tuples that is billions of virtual calls, which is exactly why modern analytic engines move to vectorized or compiled push execution. You will meet those in week 15.

Where the pipeline has to stop

Not every operator can stream. A pipeline breaker must consume its entire input before it can emit a single output tuple. Our running query has one: the ORDER BY name sort. Sort cannot know which tuple is smallest until it has seen the last input tuple, so it pulls every tuple from below, buffers them (spilling to disk via external merge sort if they exceed memory), and only then starts returning them in order. The build side of a hash join is the other classic breaker: the hash table must be fully built before any probe can match [CMU 15-445 joins].

Breakers are where a query's memory budget is set, because that is where a whole input (or a hash table, or sorted runs) must live. In PostgreSQL that budget is work_mem, the per-operation memory before a sort or hash spills to temporary disk files [PG resource config]. External sort, the hash build, and grace partitioning are the subject of week 10; here just hold the distinction: Filter and Projection are streaming with O(1) state, Sort and the hash build are breakers.

Streaming operators versus pipeline breakers.
OperatorState heldStreams or breaks
SeqScan, IndexScanone cursorstreams
Filter, Projectionone tuplestreams
Index nested loop, merge stepone tuple per sidestreams
Sort (ORDER BY)whole input or sorted runsbreaks
Hash join build sidewhole hash tablebreaks
Aggregation, DISTINCTone entry per group, or whole sortbreaks

The cost of the simplest join

Our running query has no join, but most queries do, and the join is where the iterator model first meets real I/O cost. A join of R and S is conceptually the Cartesian product filtered by a predicate. Done literally that is |R| times |S| comparisons. The nested loop join is the most direct way to run it as iterators, and its three variants show how much the same logical join can cost depending on the physical choice.

Let R (the outer) have M pages holding m tuples, and S (the inner) have N pages. The naive nested loop rereads all of S once per outer tuple:

naive nested loop:  M + (m × N)

The m × N term is the killer. The block nested loop rereads S once per outer block instead of per tuple, giving the outer scan B - 2 buffer frames (one for an inner page, one for output):

block nested loop:  M + (ceil(M / (B - 2)) × N)

The index nested loop replaces the inner scan with an index probe at roughly constant cost C per outer tuple:

index nested loop:  M + (m × C)

On the standard CMU example (M = 1000, m = 100000, N = 500, B = 100, 0.1 ms per I/O) the naive form is about 1.4 hours, the block form about 6.5 seconds, and the right index turns it into a few hundred milliseconds [CMU 15-445 joins]. Same logical join, four orders of magnitude apart. The simulator below lets you feel the block term move as you change the buffer count.

Nested loop cost calculator drag the sliders and watch the three variants diverge
Experiment: push the buffer count B up and watch only the block bar shrink. The naive bar ignores B entirely (it rereads per tuple), and the index bar ignores B too (it probes, it does not scan). More memory helps the block join and nothing else here.
In real systems

PostgreSQL is a textbook pull-based Volcano executor. A plan is a tree of Plan nodes; at execution each becomes a PlanState. ExecInitNode recursively initializes the tree (the open step), ExecProcNode pulls the next tuple from a node (the next step), and ExecEndNode tears it down (the close step). Tuples travel between operators in a TupleTableSlot, one per call, exactly the iterator contract [PostgreSQL executor README]. SQLite is different: it compiles each statement to bytecode for a virtual machine (the VDBE) and the executor is that interpreter, not a tree of operator objects. SQLite also implements joins as nested loops only, with no hash or merge join, and leans on indexes and a polynomial-time join-ordering algorithm to make them fast [SQLite query optimizer overview].

Exam trap

An index scan is not always faster than a sequential scan. For a low-selectivity predicate (one that matches most of the table) every index match triggers a potentially random heap fetch, and reading 60 percent of a table through an index can be slower than one ordered sequential pass. PostgreSQL always keeps a sequential scan plan available for exactly this reason [PG planner]. The optimizer in week 11 decides between them using estimated selectivity.

Deeper: who owns the tuple in the buffer, and why Volcano avoids temp files

When next() returns a record it returns a status plus a Next-Record structure: an RID and an address of the record pinned in the buffer pool. The ownership invariant is that each pinned record is owned by exactly one operator at a time. On receiving a record an operator may hold it (in a hash table), unfix it (when a predicate fails), or pass it on; operators that build new records, such as a join, fix their output and unfix their inputs [Graefe 1994, Section III.B]. This is the pin-count machinery from week 4 seen from the executor's side.

Volcano deliberately does not pass intermediate results through temporary files between operators, as some textbooks suggest, because that carries a substantial performance penalty. Intermediate streams live on virtual devices whose pages exist only in the buffer and vanish when unpinned, so the same buffer machinery serves both permanent and intermediate data while keeping it in memory [Graefe 1994, Sections III.A, III.B]. The one place data does hit disk is a pipeline breaker that overflows memory, which is the spill behavior of week 10.

Paper viva: Volcano

This week's assigned paper is Graefe's "Volcano: An Extensible and Parallel Query Evaluation System" (IEEE TKDE 6(1), 1994). The examiner will not ask you to recite it. They will ask you to defend why the design is shaped the way it is. Try to answer each before opening the model answer.

Viva prompt 1

Name the three procedures of the iterator interface, say what each does, and say who initiates the work in a running plan.

Model answer

The three procedures are open, next, and close. open initializes the operator's state (for example allocates a hash table) and recursively opens all of its inputs, so every iterator in a query is initiated recursively. next produces and returns one result record at a time, calling next on its inputs only when it needs more input, and signals end-of-stream when its source is exhausted; this is lazy, demand-driven evaluation. close releases resources and recursively shuts down the inputs. The consumer initiates the work: the root operator's next is called repeatedly, and each operator pulls from below only on demand. So control flows top-down and data flows bottom-up over the same call stack [Graefe 1994, Section III.B].

Viva prompt 2

Why must every operator share exactly one interface, and what does that uniformity buy? Use the terms anonymous streams and state records in your answer.

Model answer

Because an operator only ever calls open / next / close on its input, it never needs to know what produces that input. Graefe calls these anonymous inputs or streams. Uniformity buys three things. Composability: any operators nest in any combination, so the optimizer can reorder and stack them freely. Extensibility: a new operator is added without touching existing ones, because they depend only on the interface. Orthogonal parallelism: a single special operator (exchange) can be inserted anywhere to parallelize a subtree, with no change to the operators around it. The state record makes this concrete: all of an iterator's arguments and runtime state live in its state record with no static variables, and operators are linked by input pointers stored in those records. That is why the same algorithm, such as sort, can appear several times in one plan, each instance with its own state record [Graefe 1994, Sections III.B, IV, VI].

Viva prompt 3

What is a support function, and why is it the key to Volcano being data-model independent? Tie it to how SeqScan, Filter, Project, and Join map onto Volcano's operators.

Model answer

A support function is an imported function entry point plus a typeless argument that performs all item-level work: predicates, hashing, comparison, projection, and side effects. Volcano's iterators are empty algorithm shells without them. Because all interpretation of data items is imported this way, Volcano needs no instance type system and works for any data type or model; adding a new abstract data type does not change the Volcano code at all [Graefe 1994, Sections III.A, IV]. The course operators map directly. SeqScan is the file-scan iterator. Filter is the filter operator's predicate function and Project is its transform function, so Volcano fuses selection, projection, and side effects into one filter operator. Join is the one-to-one match operator, with hybrid hash and sort-merge implementations whose build, probe, and flush phases map onto open (build) and next (probe then flush) [Graefe 1994, Sections III.B.1, III.B.2].

Check yourself

Which statement about the Volcano iterator model is false?

Volcano is pull, not push. The consumer calls next() and the producer responds on demand. Pushing tuples eagerly upward is the opposite control flow and is the alternative model (the basis of query compilation), not Volcano. The other three statements all describe pull correctly.

Why can the same operator code, such as sort, appear several times in one query plan?

Volcano keeps all of an iterator instance's arguments and runtime state in its own state record, with no static variables. So two sort instances in one plan simply get two state records and never collide. Global or static state would make a second instance overwrite the first.

Which statement is false about nested loop join cost?

The block variant rereads the inner table once per outer block, not per outer tuple. Per-tuple rereading is the naive form. That distinction is exactly why adding buffers (a bigger block) shrinks the block join and leaves the naive join untouched.

A Filter operator sits above a SeqScan. How much state does Filter hold while running?

Filter is a streaming operator with O(1) state. It pulls one tuple, tests the predicate, returns it or drops it and loops. It never needs to see more than the current tuple. The operators that must buffer the whole input are pipeline breakers such as sort and the hash build side.

Which statement about an index scan versus a sequential scan is false?

An index scan is not always faster. When the predicate matches a large fraction of the table, each match triggers a potentially random heap fetch, and that can cost more than one ordered sequential pass. PostgreSQL keeps a sequential scan plan available precisely for this case.

Why does the ORDER BY in our running query force a pipeline breaker?

A sort cannot know which tuple is smallest until it has consumed every input tuple, so it must materialize the whole input (spilling via external merge sort if it overflows memory) before returning the first sorted row. That inability to emit early is what defines a pipeline breaker.

In one sentence: what does next() do, and which direction do control and data flow?
next() returns one tuple (or end-of-stream), calling next() on its inputs only when it needs more; control flows top-down (parents call children) and data flows bottom-up (one tuple bubbles to the root), over the same call stack.
Name two pipeline breakers and say what each must buffer before emitting a tuple.
Sort must buffer the whole input (or sorted runs) before returning the smallest tuple. The build side of a hash join must build the entire hash table before any probe can match. Aggregation and DISTINCT break too.
Primary source
Volcano: An Extensible and Parallel Query Evaluation System (Graefe, 1994)

Read Section III.B for the open / next / close protocol, anonymous streams, and state records, and Section III.A for support functions. This is the origin of the iterator model and the assigned paper for this week's viva. doi.org/10.1109/69.273032. Open-access mirror: volcano.pdf.

Ask your teacher

Want me to trace a plan with a join through the stepper, contrast pull with the push and vectorized models you will meet in week 15, or quiz you harder on the nested loop cost formulas? Ask. I am your teacher for this course, not just a document.