Query parsing: lexing, parsing, binding, and logical plans
A SQL statement arrives as a flat string of bytes. The engine cannot execute a string. This lesson follows that string as it becomes tokens, then a syntax tree, then a name-resolved and typed query tree, and finally a logical plan that says what to compute. Every later stage trusts this one to have told the truth.
By the end you can
- Separate the four jobs at the front of the pipeline: lex, parse, bind, rewrite, and say which one catches which kind of error.
- Explain why parsing makes no catalog lookups and why binding must, and tie that to transactions.
- Say why recursive descent (LL) loops on left recursion while a shift-reduce (LR) parser does not.
- Draw the difference between a raw parse tree, a bound query tree, and a logical plan.
- State the logical versus physical boundary and why there is no one-to-one mapping.
Hold the running query in mind:
SELECT name FROM emp WHERE salary > 75000 ORDER BY name;
In week 1 we watched this string arrive at the server: the client communications manager took the bytes, the process manager assigned a worker and ran admission control, and then it handed the raw SQL string to the relational query processor. That hand-off is where this lesson starts. The layer above gives us a string and nothing else. The layer below, the optimizer in week 11, wants something it can cost and search over: a tree of relational-algebra operators with every name resolved and every type known. Our job is to build exactly that. We turn text into a logical plan, projection(name) over selection(salary > 75000) over scan(emp) with a sort for ORDER BY, and hand it down.
This is milestone M2: the front of the pipeline is complete once you finish this week. After it, the
executor (week 9) and the optimizer never touch the SQL
string again. They work only on the structured object we produce. If we resolve emp to the
wrong table, or miss that salary is a money type, every stage downstream is operating on a
lie and the bug surfaces far from its cause.
Think of an inbound letter written in a language the engine half-knows. First you cut it into words (lexing). Then you work out the grammar, which clause is the subject and which is the condition (parsing). Then you check the words against reality: does the person named actually exist, are you allowed to read their file, does "salary times 1.15" even make sense for this column's type (binding). Only after all that do you rephrase the request into a clean internal to-do list (the logical plan). Each step assumes the previous one finished correctly.
Four jobs, in order
The naive approach is to pattern-match keywords: see SELECT, grab the words after it, see
WHERE, grab the rest. This recovers no structure. You cannot push a predicate below a join if
you do not even know which token run is the predicate. You cannot tell whether name is a
column, an alias, or a function. And you certainly cannot decide whether salary * 1.15 uses
integer or floating multiply, because a string has no types. So the engine builds structure first, then
layers meaning onto it.
The decomposition that PostgreSQL uses, and that the "Architecture of a Database System" survey describes, has a deliberate seam in it. Pure syntax checking needs no database access. Name resolution does. PostgreSQL splits raw parsing from analysis for one concrete reason: "system catalog lookups can only be done within a transaction" [PG: The Parser Stage]. Parsing a string for well-formedness needs no transaction. Checking that a table exists does. That single constraint is why the line is drawn where it is.
emp means reading the system catalog. The optimizer (week 11)
takes the final query tree.Lexing: characters become tokens
The lexer (also called the scanner or tokenizer) reads the character stream and emits a stream of
tokens: keywords like SELECT, identifiers like emp, literals like
75000 or 'foo', and operators and punctuation like > and
,. Whitespace and comments are recognized and thrown away. SQLite's documentation states the
rule plainly: whitespace and comment tokens are discarded; all others proceed to the parser
[SQLite: how it works].
A lexer is a finite automaton. Each token class is a regular expression, the union of those regexes
compiles to a deterministic finite automaton, and that DFA runs in time linear in the input length with a
constant amount of work per character. PostgreSQL generates this DFA as C code from a flex file
(scan.l); SQLite hand-writes it in tokenize.c. Either way it is the cheapest
stage in the pipeline.
SQL has hundreds of keywords. If every keyword were reserved, a user could never name a column
name or value or level. So the lexer cannot simply decide once and
for all that a word is a keyword. SQLite's Lemon grammar uses fallback tokens: a word first tried as a
keyword can fall back to being an identifier when the grammar context demands it
[SQLite: Lemon]. This keeps the reserved-word set
smaller than the keyword set, which is why SELECT name FROM emp works even though some
engines treat name as a soft keyword elsewhere.
Parsing: tokens become a tree
The parser consumes the token stream and, guided by a context-free grammar, builds a tree that captures the syntactic structure: this run of tokens is the target list, this is the FROM list, this is the WHERE expression. There are two dominant strategies, and the exam loves the contrast.
Recursive descent (top-down, LL). One function per grammar nonterminal, and the call stack
mirrors the parse tree. It is usually hand-written, easy to read, and easy to attach good error messages
to. Its classic weakness is left recursion. A rule like expr := expr '+' term makes the
expr function call itself with no token consumed, so it loops forever. Left recursion must be
rewritten into iteration or right recursion before a descent parser can handle it.
Table-driven bottom-up (LR, in practice LALR(1)). A generator such as bison, yacc, or SQLite's Lemon computes parse tables from the grammar. The parser then runs a shift-reduce loop: it shifts tokens onto a stack, and when the top of the stack matches the right-hand side of a grammar rule, it reduces that run to the rule's nonterminal. LALR(1) accepts a larger grammar class than LL, handles left recursion natively, and still runs in linear time over the tokens. The cost is debuggability: grammar conflicts show up in terms of the generated automaton, not your source.
Whichever strategy is used, parsing is a fixed-rules operation. PostgreSQL is explicit: "The parser stage creates a parse tree using only fixed rules about the syntactic structure of SQL. It does not make any lookups in the system catalogs" [PG: The Parser Stage]. The output is a raw parse tree, also called an abstract syntax tree. At this point names are still strings and nothing has been checked against the database.
Two directions get swapped constantly. Recursive descent is top-down and LL. LALR and LR are bottom-up and shift-reduce. And "recursive descent handles left recursion" is false: it is exactly what breaks a descent parser. If an option pairs "top-down" with "shift-reduce", or claims an LL parser eats left recursion directly, that is the wrong answer.
Watch a string become a tree
The simulator below is the heart of this lesson. Pick one of three example statements. Pane one shows
the token chips the lexer produces, with whitespace dropped. Pane two animates the parse, and you can flip
between the LL recursive-descent call stack and the LR shift-reduce stack to feel why left recursion behaves
differently. Pane three shows the bound query tree against a tiny catalog: names resolve or turn red, and
the type of salary drives which multiply operator a bonus expression picks. Flip the privilege
toggle to see the exact step where a syntactically perfect query is rejected for a semantic reason.
salary to money and watch the
bonus expression bind to the money multiply, then revoke SELECT on emp and step until the bind
fails. The failure lands in the binder, never in the lexer or parser.Binding: where meaning enters
Binding (PostgreSQL calls it analysis or transformation; many courses call the component the binder) walks the raw tree and produces a validated, typed query tree. The survey lists four jobs: confirm the query is correctly specified, resolve names and references, convert it into the optimizer's internal format, and verify the user is authorized [Architecture of a DB System, sec. 4.1]. Concretely:
- Canonicalize table names. Each FROM reference is expanded to a fully qualified name, in the survey's
four-part form
server.database.schema.table, reduced toschema.tablein simpler systems. Aliases and search-path defaults are resolved here, soempbecomespublic.emp. - Check existence. The analyzer calls the catalog manager to confirm each table is registered and caches
its metadata, then checks that every column reference is a real column of an in-scope table. A typo like
naemdies here, not in the parser. - Type-check and resolve overloads. Column types drive which operator and function implementations get
used. In
(EMP.salary * 1.15) < 75000the actual multiply and the assumed type of1.15depend on whethersalaryis integer, float, or money. PostgreSQL shows the same mechanism: aFuncCallnode "might be transformed to either a FuncExpr or Aggref node depending on whether the referenced name turns out to be an ordinary function or an aggregate function" [PG: The Parser Stage]. - Run structural semantic checks: consistent tuple variables, union-compatibility of set operations, legal aggregate usage, correct subquery nesting.
- Authorize. Confirm the user holds the needed privilege on each object. Some checks (row-level security) are deferred to execution because they depend on data values.
The output is the query tree. PostgreSQL: it "is structurally similar to the raw parse tree in most
places, but it has many differences in detail," with types attached
[PG: The Parser Stage]. Note what
binding does not read: it never reads table data. It reads only the query and the catalog.
Counting how many rows pass salary > 75000 is selectivity estimation, which belongs to the
cost-based optimizer in week 11, not here.
The parser checks shape, the binder checks meaning. "Table does not exist," "no such column," "wrong type," and "permission denied" are all binder errors, not parser errors. The seam exists because catalog lookups need a transaction and pure syntax checking does not.
Rewrite: normalize without changing meaning
The rewriter simplifies and normalizes the query without changing what it returns. It can rely
only on the query and on catalog metadata, and cannot read table data
[Architecture, sec. 4.2]. Its main
jobs are view expansion (splice a view's definition into the query, recursively, until no views remain;
PostgreSQL calls this the rewrite system's most important use, the realization of views), constant folding
(10+2+R.y becomes 12+R.y), predicate rewriting
(NOT salary > 1000000 becomes salary <= 1000000 so it matches an index; a
contradiction like salary < 75000 AND salary > 1000000 collapses to FALSE), semantic
optimization using integrity constraints (redundant-join elimination), and subquery flattening.
Rewrite is a logical step, not always a separate module. DB2 has a stand-alone rewriter; SQL Server folds it into an early phase of the optimizer. The boundary is conceptual.
"Query rewrite reads table data to simplify the query" is false. Rewrite uses only the query and the catalog. Reading data and estimating costs is the optimizer's job. Equally false: "view expansion happens in the optimizer." View expansion is driven by catalog view definitions and runs before cost-based optimization.
From query tree to logical plan
The validated, rewritten query tree is translated into a logical plan: a tree of relational-algebra operators. As CMU puts it, the logical plan is roughly the relational algebra expression for the query [CMU 15-445 L15]. The operators are the algebra primitives: selection (filter), projection (column list), cross product and join, grouping and aggregation, set operations, sort, and limit. Our running query maps to projection(name) over selection(salary > 75000) over scan(emp), with a sort on top for the ORDER BY.
The logical plan says only what relation to produce. It does not say how to scan emp
(sequential or index), which join algorithm to use, or in what order to join. That is the physical plan's
job, and choosing among physical options is the optimizer's job in week 11.
salary; the selection can fold into the scan or stand alone. CMU
states it outright: there does not always exist a one-to-one mapping from logical to physical plans. The
optimizer searches these physical choices and picks the cheapest.| Artifact | Produced by | Knows | Does not know |
|---|---|---|---|
| Token stream | Lexer | token classes | structure, names, types |
| Raw parse tree | Parser | syntactic structure | whether names exist, types |
| Query tree | Binder | names, types, privileges | row counts, costs |
| Logical plan | Translate / rewrite | algebra shape (what) | algorithms, access paths (how) |
PostgreSQL keeps explicit, separately named tree stages: string to flex tokens to bison raw parse tree
to analyzed query tree to rewritten query tree to plan tree, then a tree-walking iterator executor
[PG: The Path of a Query]. SQLite
collapses analysis through code generation into one pass and compiles the AST straight to VDBE bytecode
for a register virtual machine; EXPLAIN prints that bytecode
[SQLite: how it works]. SQLite also uses Lemon
rather than yacc, where the tokenizer calls the parser, which makes it reentrant and threadsafe
[SQLite: Lemon]. Different middles, identical front:
both start with a tokenizer, then a grammar-driven parser, then an AST.
Deeper: why Lemon inverts the call direction
In yacc the parser calls the tokenizer to pull the next token, and yacc keeps parser state in global
variables, so it is neither reentrant nor threadsafe. In Lemon the tokenizer calls the parser, feeding
it one token at a time, so all state is passed explicitly and the parser is reentrant. SQLite needs this
because it parses recursively: handling CREATE TABLE invokes the parser again to generate
the INSERT into the sqlite_schema table
[SQLite: Lemon]. Lemon also has nonterminal destructors
to reclaim memory on a syntax error, and fallback tokens for keyword/identifier collisions.
# the front of the pipeline, as four stages with a hard seam
def front_of_pipeline(sql_text):
tokens = lex(sql_text) # DFA, O(n), no catalog
raw_tree = parse(tokens) # grammar only, no catalog
# ---- seam: everything below needs a transaction ----
qtree = bind(raw_tree, catalog) # resolve names, types, authorize
qtree = rewrite(qtree, catalog) # views, folding; never reads data
return to_logical_plan(qtree) # algebra: WHAT, not HOW
Check yourself
Which statement is false?
Existence and name resolution are binding, not parsing. The pure parser makes no catalog lookups; PostgreSQL says the parse stage "does not make any lookups in the system catalogs." The trap conflates parsing (shape) with binding (meaning).
Which statement about parsing strategies is false?
Left recursion is exactly what breaks recursive descent: a rule like
expr := expr + term calls itself with no token consumed and loops forever, so the grammar
must be refactored. LR/LALR handle left recursion natively.
Which statement about logical and physical plans is false?
Join algorithm and access method are physical decisions. A logical plan says join, not hash-join; scan, not index-scan. One logical join can become nested-loop, hash, or merge; one logical scan can become a sequential or several index scans.
Which statement about query rewrite is false?
Rewrite uses only the query and the catalog, never table data. Reading data and estimating cost is the cost-based optimizer's job, a later stage. View expansion and predicate folding both run on the query and catalog alone.
PostgreSQL separates raw parsing from analysis. Why?
The seam is drawn so that pure syntactic parsing, which needs no transaction, is separated from analysis, which calls the catalog and therefore must run inside a transaction. The optimizer takes the rewritten query tree, not the raw parse tree.
In (EMP.salary * 1.15) < 75000, what determines which multiply operator is chosen?
Overload resolution during binding uses the column's type to pick the operator and to type the literal. Row counts are selectivity (optimizer), and the scan choice is physical planning; neither selects the multiply operator.
Highest-trust source for this lesson: it states the rule that parsing makes no catalog
lookups, names flex (scan.l) and bison (gram.y), distinguishes the raw parse tree
from the query tree, and gives the FuncCall to FuncExpr/Aggref transformation as the worked example of
binding. Pair it with The Path of a Query for the five-stage overview.
Want me to trace a different statement through all four stages, draw the LL versus LR stacks for a real left-recursive expression grammar, or quiz you on which error each stage catches? Ask, and I will go deeper or harder. I am your teacher for this course, not just a document.