Field Guide / Part V · Query processing / Week 08

Query processor · front of the pipeline

Query parsing: lexing, parsing, binding, and logical plans

A SQL statement arrives as a flat string of bytes. The engine cannot execute a string. This lesson follows that string as it becomes tokens, then a syntax tree, then a name-resolved and typed query tree, and finally a logical plan that says what to compute. Every later stage trusts this one to have told the truth.

By the end you can

Separate the four jobs at the front of the pipeline: lex, parse, bind, rewrite, and say which one catches which kind of error.
Explain why parsing makes no catalog lookups and why binding must, and tie that to transactions.
Say why recursive descent (LL) loops on left recursion while a shift-reduce (LR) parser does not.
Draw the difference between a raw parse tree, a bound query tree, and a logical plan.
State the logical versus physical boundary and why there is no one-to-one mapping.

Hold the running query in mind:

SELECT name FROM emp WHERE salary > 75000 ORDER BY name;

In week 1 we watched this string arrive at the server: the client communications manager took the bytes, the process manager assigned a worker and ran admission control, and then it handed the raw SQL string to the relational query processor. That hand-off is where this lesson starts. The layer above gives us a string and nothing else. The layer below, the optimizer in week 11, wants something it can cost and search over: a tree of relational-algebra operators with every name resolved and every type known. Our job is to build exactly that. We turn text into a logical plan, projection(name) over selection(salary > 75000) over scan(emp) with a sort for ORDER BY, and hand it down.

This is milestone M2: the front of the pipeline is complete once you finish this week. After it, the executor (week 9) and the optimizer never touch the SQL string again. They work only on the structured object we produce. If we resolve emp to the wrong table, or miss that salary is a money type, every stage downstream is operating on a lie and the bug surfaces far from its cause.

Intuition

Think of an inbound letter written in a language the engine half-knows. First you cut it into words (lexing). Then you work out the grammar, which clause is the subject and which is the condition (parsing). Then you check the words against reality: does the person named actually exist, are you allowed to read their file, does "salary times 1.15" even make sense for this column's type (binding). Only after all that do you rephrase the request into a clean internal to-do list (the logical plan). Each step assumes the previous one finished correctly.

Four jobs, in order

The naive approach is to pattern-match keywords: see SELECT, grab the words after it, see WHERE, grab the rest. This recovers no structure. You cannot push a predicate below a join if you do not even know which token run is the predicate. You cannot tell whether name is a column, an alias, or a function. And you certainly cannot decide whether salary * 1.15 uses integer or floating multiply, because a string has no types. So the engine builds structure first, then layers meaning onto it.

The decomposition that PostgreSQL uses, and that the "Architecture of a Database System" survey describes, has a deliberate seam in it. Pure syntax checking needs no database access. Name resolution does. PostgreSQL splits raw parsing from analysis for one concrete reason: "system catalog lookups can only be done within a transaction" [PG: The Parser Stage]. Parsing a string for well-formedness needs no transaction. Checking that a table exists does. That single constraint is why the line is drawn where it is.

Figure 1. The PostgreSQL chain. Notice the seam: everything up to the raw parse tree uses only fixed syntactic rules and touches no catalog. Analysis is the first stage that needs a transaction, because resolving emp means reading the system catalog. The optimizer (week 11) takes the final query tree.

Lexing: characters become tokens

The lexer (also called the scanner or tokenizer) reads the character stream and emits a stream of tokens: keywords like SELECT, identifiers like emp, literals like 75000 or 'foo', and operators and punctuation like > and ,. Whitespace and comments are recognized and thrown away. SQLite's documentation states the rule plainly: whitespace and comment tokens are discarded; all others proceed to the parser [SQLite: how it works].

A lexer is a finite automaton. Each token class is a regular expression, the union of those regexes compiles to a deterministic finite automaton, and that DFA runs in time linear in the input length with a constant amount of work per character. PostgreSQL generates this DFA as C code from a flex file (scan.l); SQLite hand-writes it in tokenize.c. Either way it is the cheapest stage in the pipeline.

Why keyword versus identifier is not obvious

SQL has hundreds of keywords. If every keyword were reserved, a user could never name a column name or value or level. So the lexer cannot simply decide once and for all that a word is a keyword. SQLite's Lemon grammar uses fallback tokens: a word first tried as a keyword can fall back to being an identifier when the grammar context demands it [SQLite: Lemon]. This keeps the reserved-word set smaller than the keyword set, which is why SELECT name FROM emp works even though some engines treat name as a soft keyword elsewhere.

Parsing: tokens become a tree

The parser consumes the token stream and, guided by a context-free grammar, builds a tree that captures the syntactic structure: this run of tokens is the target list, this is the FROM list, this is the WHERE expression. There are two dominant strategies, and the exam loves the contrast.

Recursive descent (top-down, LL). One function per grammar nonterminal, and the call stack mirrors the parse tree. It is usually hand-written, easy to read, and easy to attach good error messages to. Its classic weakness is left recursion. A rule like expr := expr '+' term makes the expr function call itself with no token consumed, so it loops forever. Left recursion must be rewritten into iteration or right recursion before a descent parser can handle it.

Table-driven bottom-up (LR, in practice LALR(1)). A generator such as bison, yacc, or SQLite's Lemon computes parse tables from the grammar. The parser then runs a shift-reduce loop: it shifts tokens onto a stack, and when the top of the stack matches the right-hand side of a grammar rule, it reduces that run to the rule's nonterminal. LALR(1) accepts a larger grammar class than LL, handles left recursion natively, and still runs in linear time over the tokens. The cost is debuggability: grammar conflicts show up in terms of the generated automaton, not your source.

Whichever strategy is used, parsing is a fixed-rules operation. PostgreSQL is explicit: "The parser stage creates a parse tree using only fixed rules about the syntactic structure of SQL. It does not make any lookups in the system catalogs" [PG: The Parser Stage]. The output is a raw parse tree, also called an abstract syntax tree. At this point names are still strings and nothing has been checked against the database.

Exam trap

Two directions get swapped constantly. Recursive descent is top-down and LL. LALR and LR are bottom-up and shift-reduce. And "recursive descent handles left recursion" is false: it is exactly what breaks a descent parser. If an option pairs "top-down" with "shift-reduce", or claims an LL parser eats left recursion directly, that is the wrong answer.

Watch a string become a tree

The simulator below is the heart of this lesson. Pick one of three example statements. Pane one shows the token chips the lexer produces, with whitespace dropped. Pane two animates the parse, and you can flip between the LL recursive-descent call stack and the LR shift-reduce stack to feel why left recursion behaves differently. Pane three shows the bound query tree against a tiny catalog: names resolve or turn red, and the type of salary drives which multiply operator a bonus expression picks. Flip the privilege toggle to see the exact step where a syntactically perfect query is rejected for a semantic reason.

Lexer · parser · binder tracer step to advance, toggle the parser view and the catalog

Experiment: set salary to money and watch the bonus expression bind to the money multiply, then revoke SELECT on emp and step until the bind fails. The failure lands in the binder, never in the lexer or parser.

Binding: where meaning enters

Binding (PostgreSQL calls it analysis or transformation; many courses call the component the binder) walks the raw tree and produces a validated, typed query tree. The survey lists four jobs: confirm the query is correctly specified, resolve names and references, convert it into the optimizer's internal format, and verify the user is authorized [Architecture of a DB System, sec. 4.1]. Concretely:

Canonicalize table names. Each FROM reference is expanded to a fully qualified name, in the survey's four-part form server.database.schema.table, reduced to schema.table in simpler systems. Aliases and search-path defaults are resolved here, so emp becomes public.emp.
Check existence. The analyzer calls the catalog manager to confirm each table is registered and caches its metadata, then checks that every column reference is a real column of an in-scope table. A typo like naem dies here, not in the parser.
Type-check and resolve overloads. Column types drive which operator and function implementations get used. In (EMP.salary * 1.15) < 75000 the actual multiply and the assumed type of 1.15 depend on whether salary is integer, float, or money. PostgreSQL shows the same mechanism: a FuncCall node "might be transformed to either a FuncExpr or Aggref node depending on whether the referenced name turns out to be an ordinary function or an aggregate function" [PG: The Parser Stage].
Run structural semantic checks: consistent tuple variables, union-compatibility of set operations, legal aggregate usage, correct subquery nesting.
Authorize. Confirm the user holds the needed privilege on each object. Some checks (row-level security) are deferred to execution because they depend on data values.

The output is the query tree. PostgreSQL: it "is structurally similar to the raw parse tree in most places, but it has many differences in detail," with types attached [PG: The Parser Stage]. Note what binding does not read: it never reads table data. It reads only the query and the catalog. Counting how many rows pass salary > 75000 is selectivity estimation, which belongs to the cost-based optimizer in week 11, not here.

Keep this one thing

The parser checks shape, the binder checks meaning. "Table does not exist," "no such column," "wrong type," and "permission denied" are all binder errors, not parser errors. The seam exists because catalog lookups need a transaction and pure syntax checking does not.

Rewrite: normalize without changing meaning

The rewriter simplifies and normalizes the query without changing what it returns. It can rely only on the query and on catalog metadata, and cannot read table data [Architecture, sec. 4.2]. Its main jobs are view expansion (splice a view's definition into the query, recursively, until no views remain; PostgreSQL calls this the rewrite system's most important use, the realization of views), constant folding (10+2+R.y becomes 12+R.y), predicate rewriting (NOT salary > 1000000 becomes salary <= 1000000 so it matches an index; a contradiction like salary < 75000 AND salary > 1000000 collapses to FALSE), semantic optimization using integrity constraints (redundant-join elimination), and subquery flattening.

Rewrite is a logical step, not always a separate module. DB2 has a stand-alone rewriter; SQL Server folds it into an early phase of the optimizer. The boundary is conceptual.

Exam trap

"Query rewrite reads table data to simplify the query" is false. Rewrite uses only the query and the catalog. Reading data and estimating costs is the optimizer's job. Equally false: "view expansion happens in the optimizer." View expansion is driven by catalog view definitions and runs before cost-based optimization.

From query tree to logical plan

The validated, rewritten query tree is translated into a logical plan: a tree of relational-algebra operators. As CMU puts it, the logical plan is roughly the relational algebra expression for the query [CMU 15-445 L15]. The operators are the algebra primitives: selection (filter), projection (column list), cross product and join, grouping and aggregation, set operations, sort, and limit. Our running query maps to projection(name) over selection(salary > 75000) over scan(emp), with a sort on top for the ORDER BY.

The logical plan says only what relation to produce. It does not say how to scan emp (sequential or index), which join algorithm to use, or in what order to join. That is the physical plan's job, and choosing among physical options is the optimizer's job in week 11.

Figure 2. The logical-to-physical boundary. One logical scan can become a sequential scan or an index scan on salary; the selection can fold into the scan or stand alone. CMU states it outright: there does not always exist a one-to-one mapping from logical to physical plans. The optimizer searches these physical choices and picks the cheapest.

The four artifacts at the front of the pipeline, and what each knows.
Artifact	Produced by	Knows	Does not know
Token stream	Lexer	token classes	structure, names, types
Raw parse tree	Parser	syntactic structure	whether names exist, types
Query tree	Binder	names, types, privileges	row counts, costs
Logical plan	Translate / rewrite	algebra shape (what)	algorithms, access paths (how)

In real systems

PostgreSQL keeps explicit, separately named tree stages: string to flex tokens to bison raw parse tree to analyzed query tree to rewritten query tree to plan tree, then a tree-walking iterator executor [PG: The Path of a Query]. SQLite collapses analysis through code generation into one pass and compiles the AST straight to VDBE bytecode for a register virtual machine; EXPLAIN prints that bytecode [SQLite: how it works]. SQLite also uses Lemon rather than yacc, where the tokenizer calls the parser, which makes it reentrant and threadsafe [SQLite: Lemon]. Different middles, identical front: both start with a tokenizer, then a grammar-driven parser, then an AST.

Deeper: why Lemon inverts the call direction

In yacc the parser calls the tokenizer to pull the next token, and yacc keeps parser state in global variables, so it is neither reentrant nor threadsafe. In Lemon the tokenizer calls the parser, feeding it one token at a time, so all state is passed explicitly and the parser is reentrant. SQLite needs this because it parses recursively: handling CREATE TABLE invokes the parser again to generate the INSERT into the sqlite_schema table [SQLite: Lemon]. Lemon also has nonterminal destructors to reclaim memory on a syntax error, and fallback tokens for keyword/identifier collisions.

# the front of the pipeline, as four stages with a hard seam
def front_of_pipeline(sql_text):
    tokens   = lex(sql_text)              # DFA, O(n), no catalog
    raw_tree = parse(tokens)              # grammar only, no catalog
    # ---- seam: everything below needs a transaction ----
    qtree    = bind(raw_tree, catalog)    # resolve names, types, authorize
    qtree    = rewrite(qtree, catalog)    # views, folding; never reads data
    return to_logical_plan(qtree)         # algebra: WHAT, not HOW

Check yourself

Which statement is false?

Existence and name resolution are binding, not parsing. The pure parser makes no catalog lookups; PostgreSQL says the parse stage "does not make any lookups in the system catalogs." The trap conflates parsing (shape) with binding (meaning).

Which statement about parsing strategies is false?

Left recursion is exactly what breaks recursive descent: a rule like expr := expr + term calls itself with no token consumed and loops forever, so the grammar must be refactored. LR/LALR handle left recursion natively.

Which statement about logical and physical plans is false?

Join algorithm and access method are physical decisions. A logical plan says join, not hash-join; scan, not index-scan. One logical join can become nested-loop, hash, or merge; one logical scan can become a sequential or several index scans.

Which statement about query rewrite is false?

Rewrite uses only the query and the catalog, never table data. Reading data and estimating cost is the cost-based optimizer's job, a later stage. View expansion and predicate folding both run on the query and catalog alone.

PostgreSQL separates raw parsing from analysis. Why?

The seam is drawn so that pure syntactic parsing, which needs no transaction, is separated from analysis, which calls the catalog and therefore must run inside a transaction. The optimizer takes the rewritten query tree, not the raw parse tree.

In (EMP.salary * 1.15) < 75000, what determines which multiply operator is chosen?

Overload resolution during binding uses the column's type to pick the operator and to type the literal. Row counts are selectivity (optimizer), and the scan choice is physical planning; neither selects the multiply operator.

Where is the seam in the front of the pipeline, and why is it there?

Between parsing and binding. Lexing and parsing use fixed syntactic rules and need no catalog and no transaction. Binding resolves names and types against the catalog, and catalog lookups can only happen inside a transaction.

Name the four artifacts in order and the one fact each adds.

Token stream (classes) to raw parse tree (structure) to query tree (names + types + authorization) to logical plan (relational-algebra shape, what not how).

Primary source

Read this next: PostgreSQL documentation, The Parser Stage

Highest-trust source for this lesson: it states the rule that parsing makes no catalog lookups, names flex (scan.l) and bison (gram.y), distinguishes the raw parse tree from the query tree, and gives the FuncCall to FuncExpr/Aggref transformation as the worked example of binding. Pair it with The Path of a Query for the five-stage overview.

Ask your teacher

Want me to trace a different statement through all four stages, draw the LL versus LR stacks for a real left-recursive expression grammar, or quiz you on which error each stage catches? Ask, and I will go deeper or harder. I am your teacher for this course, not just a document.