Modern architectures: LSM trees, columnar storage, DuckDB
For fourteen weeks we built one engine: a B+tree over a buffer pool over slotted pages, write-ahead logged and MVCC concurrent. This week we put that engine under two pressures it was not built for, write-heavy ingest and column-wide analytic scans, and watch the comfortable design lose to two others.
By the end you can
- Explain write amplification on an update-in-place B+tree and why the LSM tree turns random writes into sequential ones.
- Walk the LSM read and write paths, and reason about the read, write, and space amplification trade-off (the RUM conjecture) when you switch compaction policy.
- Say why a Bloom filter is safe for skipping SSTables, and exactly what it can and cannot get wrong.
- Contrast row (NSM) and column (DSM) layout and explain why columnar plus vectorized execution wins OLAP, and DuckDB as a concrete embedded example.
- Defend the Bigtable write path in a viva, and pick out the false statement an MCQ will hide.
This is the first of two weeks that do not add a new layer. They revisit the whole stack under new
pressure. Recall the two journeys. The read journey took
SELECT name FROM emp WHERE salary > 75000 ORDER BY name down through the parser
(week 8), the optimizer (week 11),
the Volcano executor (weeks 9 and
10), the B+tree index (weeks 6
and 7), the buffer pool (weeks 4
and 5), and the slotted heap pages
(weeks 2 and 3).
The write journey ran an UPDATE through MVCC (week 13) and
the ARIES write-ahead log (week 14).
Nothing new is handed in from above and nothing new is handed down. Instead three subsystems get replaced. The update-in-place storage and index layers (weeks 2, 3, 6, 7) give way to a log-structured design: a memtable plus a write-ahead log plus immutable sorted files plus background compaction. The row-oriented slotted page gives way to a column layout for analytic scans. And the pull-based Volcano iterator of week 9 gives way to vectorized, push-based execution. The buffer pool, the WAL durability idea, and MVCC all reappear, in new clothes. Next week, week 16, takes this same machinery and spreads it across machines, using the same paper (Bigtable) as the bridge.
A B+tree edits the disk in place: to add one row you find its leaf, read that page, change a few bytes, write the page back. When writes pour in faster than the working set fits in cache, the leaf you need is almost never the leaf in memory, so each small insert becomes a random read followed by a random write. The LSM tree refuses to edit in place. It collects writes in memory, then dumps them to disk in one big sorted file, sequentially, and cleans up later in the background. Many tiny random writes become a few large sequential ones. You pay for that on reads, because the newest copy of a key could be in any of those files.
Three pressures on the comfortable design
The single-node engine we built is read-optimized and row-oriented. Three workloads break it, and each one points to a different fix.
A single logical insert of a small row into a B+tree can force a page read, an in-memory modification, a page write, a WAL record, and eventual checkpoint flushing. Under a write-heavy load the leaf you need is rarely cached, so inserts become random reads followed by random writes. On spinning disks random writes are the worst case; even on flash they burn endurance and trigger the device's own garbage collection. The B+tree cannot turn those random writes into sequential ones, because its whole correctness story is keeping the on-disk tree sorted and balanced at all times. The fix is to stop keeping the on-disk structure perfectly sorted at all times and instead batch and defer. That is the log-structured merge tree [O'Neil, Cheng, Gawlick, O'Neil 1996].
The second pressure is the analytic scan. A row-store keeps a tuple's columns contiguous, which is ideal when a query reads whole rows: fetch this order, update this account. An analytic query does the opposite. It touches two or three columns of a billion-row table and aggregates them. A row layout drags every other column through the memory hierarchy to reach the two it needs, wasting bandwidth and cache lines. The fix is to lay data out by column, which makes column projection free at the storage layer and unlocks far better compression.
The third pressure is that one machine has finite disk, finite memory, and a single failure domain. That is next week. For now the lesson is that no amount of tuning the buffer pool or the B+tree addresses the first two pressures. They are the wrong shape.
The log-structured merge tree
The LSM tree was introduced by O'Neil, Cheng, Gawlick, and O'Neil in 1996 as a multi-component structure: a memory-resident component and one or more disk-resident components, with a rolling merge that continuously migrates entries from the smaller component to the larger one [Acta Informatica 1996]. The point is to defer and batch index changes so that many logical inserts are amortized into one large sequential disk pass. Modern implementations (RocksDB, LevelDB, Cassandra, Bigtable) settled on four concrete pieces.
- Memtable. An in-memory sorted structure, commonly a skip list, that absorbs all writes. A write is an append to the memtable plus an append to the log. Both are cheap and sequential.
- Write-ahead log. An on-disk append-only log written before or alongside the memtable update, so the volatile memtable can be recovered after a crash. This is the same durability idea as the relational WAL from week 14, used here only to protect the in-memory memtable, not to undo or redo pages.
- Immutable SSTables. When the memtable fills, it is frozen, a fresh memtable takes over, and the frozen one is flushed to disk as a sorted string table: a persistent, ordered, immutable map from key to value, the exact definition from the Bigtable paper [Chang et al. 2006, Section 4]. Immutability is the whole trick. Files are written once, sequentially, and never updated in place. Updates and deletes are new entries in newer SSTables; a delete writes a tombstone marker.
- Compaction. A background process that merges SSTables, drops superseded versions and tombstones, and keeps the number of files (and so the read cost) bounded.
The write path is therefore append to the WAL, insert into the memtable, and occasionally flush an SSTable. All sequential. The cost is paid on reads and on compaction. A point lookup must check the memtable, then potentially several SSTables from newest to oldest, because the key could live in any of them and the newest copy wins. Without help, a read touches one file per sorted run. Two mechanisms cut that down: a per-SSTable block index turns a within-file lookup into a binary search, and a per-SSTable Bloom filter lets a read skip files that cannot contain the key without any I/O.
A B+tree does roughly one logarithmic-depth random write per insert and one random read per lookup: balanced, update-in-place, read-optimized. An LSM tree makes writes cheap and sequential but pays in read amplification and space amplification, plus background compaction I/O. Neither is strictly better. The choice is the workload.
The three amplifications and the RUM conjecture
An LSM tree pays its write debt in three different currencies, and you cannot minimize all three at once.
| Amplification | What it measures | Pushed up by |
|---|---|---|
| Write | bytes physically written / bytes logically inserted | leveled compaction (often >10 in RocksDB) |
| Read | files or pages read to answer one lookup | tiered compaction (many runs to probe) |
| Space | bytes on disk / bytes of live data | tiered compaction (dead data lingers, transient copies) |
This is the read-update-memory (RUM) conjecture in practice: you optimize at most two of read, update (write), and memory (space). RocksDB states the policies as direct trade-offs. Leveled compaction "minimizes space amplification at the cost of read and write amplification"; universal (tiered) compaction "minimizes write amplification at the cost of read and space amplification" [RocksDB wiki, Compaction].
Tiered (size-tiered, RocksDB's "universal") groups SSTables of similar size. When enough runs of one size accumulate, they merge into one larger run at the next tier, so each level holds several sorted runs. Few rewrites per byte means low write amplification, but a read may have to check many runs (high read amplification), and two copies of a large run can coexist during a merge (high transient space amplification). Leveled organizes data into levels L0, L1, L2, and so on, where every non-zero level is a single sorted run, range-partitioned across fixed-size files with non-overlapping key ranges. Each level has a size target a fixed multiplier larger than the one above it (the default multiplier is 10, giving roughly 16 KB, 160 KB, 1.6 MB, 16 MB targets from a 16 KB base) [RocksDB wiki, Leveled Compaction]. A read checks at most one file per level (low read amplification) and dead data is bounded (low space amplification), but a byte is rewritten roughly once per level it descends, so write amplification grows with the number of levels. One line: tiered favors write-heavy workloads and accepts slower reads; leveled favors read-heavy workloads and accepts more write churn.
"Leveled compaction has lower write amplification than tiered" is false, and it is the reverse of the truth. Leveled minimizes space and read amplification at the cost of higher write amplification (the >10 figure in RocksDB); tiered minimizes write amplification at the cost of higher read and space amplification [RocksDB wiki]. A second trap: "LSM trees are strictly faster than B-trees." Also false. LSM optimizes writes and pays on reads and space, and compaction consumes background I/O and CPU. The honest answer is workload-dependent.
Bloom filters: the asymmetry that makes them safe
A Bloom filter answers one question about an SSTable: is key X possibly in this file? It answers either "definitely not" or "maybe". It never produces a false negative, only a false positive. That asymmetry is exactly what an LSM read needs. A "definitely not" lets the reader skip an entire SSTable with zero I/O, and a false positive only costs a wasted lookup that returns nothing.
The mechanics are a bit array of m bits and k independent hash functions. To insert a key, hash it k ways and set those k bits. To test a key, hash it k ways and check those k bits. If any is 0 the key is absent; if all are 1 the key is probably present. The false-positive probability after inserting n keys is approximately (1 - e^(-kn/m))^k, minimized at k = (m/n) ln 2, and at that optimum the false-positive rate is about (0.6185)^(m/n) [Bloom filter false-positive derivation]. RocksDB's common configuration is about 10 bits per key, which gives a false-positive rate near 1 percent.
"Bloom filters can give false negatives" is false. A 0 bit means at least one of the key's hash positions was never set, which is impossible if the key were present, so absence is certain. They give only false positives. A related trap: you cannot delete a key from a standard Bloom filter by clearing its bits, because a bit might be shared with another key, which would introduce false negatives. Deletion needs a counting Bloom filter, not the standard one.
Columnar storage: transpose the table
A row-store (the n-ary storage model, NSM) stores the columns of one tuple contiguously: row 1's a, b,
c, then row 2's a, b, c. This is the slotted page from week
2. A column-store (the decomposition storage model, DSM) stores all values of column a contiguously,
then all of column b, then all of column c. The same logical table, physically transposed. For the
running example SELECT name FROM emp WHERE salary > 75000, an analytic version over a
fifty-column emp table touches only name and salary. Under DSM the
scan reads two column segments and ignores forty-eight. Under NSM it must read every page, because each
page carries all fifty columns, to extract the two.
Columnar wins OLAP for three reasons that compound. First, I/O and bandwidth: a query reading 2 of 50 columns reads only those 2 segments. Second, compression: a column holds one type and often low cardinality or sortedness, so run-length encoding, dictionary encoding, bit-packing, and frame-of-reference all work far better than on a heterogeneous row, which means fewer bytes moved and more data per cache line. Third, vectorized scans: with one type laid out contiguously, the engine processes a batch of values in a tight loop the CPU can keep in registers and SIMD lanes.
"Columnar storage is always better" is false. Row-stores still win OLTP. A point insert, update, or delete of a whole tuple touches one contiguous place in NSM, whereas a column-store would scatter that one row across many column segments. The row-versus-column choice maps onto the OLTP-versus-OLAP split, which is why hybrid (HTAP) systems keep both layouts.
Vectorized, push-based execution and DuckDB
Recall the Volcano model from week 9: every operator is an iterator with open, next, close, and the root pulls one tuple at a time up the tree. The cost is one virtual call per operator per tuple. For an analytic scan over a billion rows, that per-tuple interpreter overhead dominates. Vectorized execution amortizes it. Instead of one tuple per next call, an operator processes a whole vector (a batch) of values in one call, so the interpreter dispatch is paid once per batch, not once per row.
DuckDB is the concrete embedded example. It is an in-process analytical database, deliberately "SQLite for analytics": it "does not run as a separate process, but completely embedded within a host process," and its engine is a "columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a vector) are processed in one operation" [DuckDB, why_duckdb]. The unit of work is a Vector (one column's values) and a DataChunk (a set of vectors forming a horizontal slice of columns), with a default vector size of 2048 tuples [DuckDB docs, Execution Format]. DuckDB carries several physical vector encodings through execution: flat (uncompressed), constant (one repeated value), dictionary (a child vector plus a selection vector of indices), and sequence (offset plus increment, for row ids). Keeping data dictionary-encoded inside the engine is late materialization in action: a filter runs over the compressed column directly, producing a selection vector of qualifying positions, and full rows are only stitched together at the very end.
DuckDB also moved from the pull-based Volcano model (which it used before 2021) to a push-based model, where source operators push chunks up into the pipeline. Push makes it easier to add operators and to run several pipelines concurrently, and it pairs with morsel-driven parallelism, where work is split into small morsels of a pipeline's input that worker threads grab dynamically. And despite being columnar and analytic, DuckDB is still transactional: it uses a bulk-optimized, HyPer-style MVCC for ACID, the same multi-version idea from week 13 in new clothes.
"Vectorized execution means compiling the query to machine code" is false, at least for DuckDB. DuckDB queries are still interpreted. Vectorization means processing a batch (a vector, default 2048 values) per operator call to amortize interpreter overhead, which is distinct from query compilation [DuckDB, why_duckdb]. Two different techniques attack the same per-tuple-overhead problem; do not conflate them.
Core PostgreSQL is a row-store with an update-in-place heap and a B+tree (nbtree), not an LSM tree; its MVCC keeps old row versions in the heap rather than in stacked SSTables, which is exactly why it needs VACUUM (week 3). Columnar and LSM behavior in the PostgreSQL world come from extensions and forks (column-store extensions, OrioleDB), a useful contrast: PostgreSQL chose read-optimized update-in-place and pays the VACUUM cost, rather than the LSM write-optimized path. SQLite is the embedded OLTP row-store baseline that DuckDB consciously mirrors for OLAP, and SQLite's WAL mode maps cleanly onto the LSM append idea even though SQLite is a B-tree store: in WAL mode a COMMIT appends a commit record to a separate WAL file, and a checkpoint later moves those pages into the main database file [SQLite, Write-Ahead Logging]. The mnemonic: SQLite is embedded row-store OLTP, DuckDB is embedded column-store OLAP.
The LSM write and read paths, step by step
- Write: append the mutation to the WAL, then insert it into the sorted in-memory memtable. Both are sequential. The write is durable once the log write commits.
- Flush: when the memtable hits its size threshold, freeze it, start a fresh memtable, and write the frozen one to disk as a new immutable SSTable. This is Bigtable's minor compaction.
- Compact: a background pass merges several SSTables (and the memtable) into one, dropping superseded versions and tombstones, bounding the number of files a read must merge.
- Read: merge the memtable with the SSTables from newest to oldest; the newest copy of the key wins. A per-SSTable Bloom filter skips files that cannot hold the key, and a block index turns the within-file search into a binary search.
- Delete: write a tombstone into a new SSTable that suppresses older live values during the merge. The data physically disappears only when a major compaction rewrites everything into one file with no tombstones.
# LSM put and get, the shape RocksDB and Bigtable share
def put(key, value):
wal.append(key, value) # sequential, durable
memtable.insert(key, value) # sorted, in memory
if memtable.size >= THRESHOLD:
flush(memtable) # freeze -> new immutable SSTable
def get(key):
if key in memtable: return memtable[key]
for sst in sstables_newest_first():
if not sst.bloom.might_contain(key): continue # zero I/O skip
v = sst.lookup(key) # binary search via block index
if v is TOMBSTONE: return NOT_FOUND
if v is not None: return v
return NOT_FOUND
Deeper: why immutability is the load-bearing property
Everything good about the LSM read and write paths follows from SSTables never being edited in place. Because SSTables are immutable, reads need no file-system synchronization, so concurrency control on stored data is nearly free; the only mutable structure read and written together is the memtable, which Bigtable makes copy-on-write so reads and writes run in parallel [Chang et al. 2006, Section 6]. Deletion becomes garbage collection of obsolete SSTables, done as mark-and-sweep. And a range split is cheap because the two children share the parent's SSTables instead of rewriting them. The price of immutability is precisely the read and space amplification we have been measuring: you cannot overwrite a stale value, so it lingers in an old file until compaction removes it, and a read may have to look past several stale copies to find the live one.
Paper viva: Bigtable
This week's assigned paper is "Bigtable: A Distributed Storage System for Structured Data" (Chang et al., OSDI 2006). It is the canonical production LSM write path and the bridge to next week's distributed material. The examiner will not ask you to recite it. They will ask you to walk its mechanism and defend why it is shaped that way. Try each before opening the model answer.
Walk the Bigtable write path and the read path. Why is the read a cheap merge, and what keeps the number of files a read must touch bounded?
Model answer
On a write, the tablet server first checks the mutation is well-formed and the sender is authorized (reading the permitted-writers list from a Chubby file, almost always a cache hit), writes the mutation to the per-server commit log using group commit, and only after that commit inserts it into the in-memory sorted memtable [Chang et al. 2006, Section 5.3]. The data is durable once the log write commits; the memtable insert just makes it readable. A read executes over a merged view of the memtable plus the tablet's sequence of SSTables. Both the memtable and every SSTable are lexicographically sorted, so the read is an efficient merge of sorted streams rather than a scan of unsorted files. What keeps the file count bounded is compaction (Section 5.4): minor compaction flushes a full memtable to a new SSTable; merging compaction folds a few SSTables plus the memtable into one; major compaction rewrites all SSTables into exactly one. Per-locality-group Bloom filters and the block and scan caches further cut how many SSTables a read actually touches (Section 6).
SSTables are immutable. Define the three compaction types, and explain how Bigtable deletes data given that nothing is ever overwritten in place.
Model answer
Minor compaction freezes a full memtable and writes it to GFS as a new SSTable, shrinking tablet-server memory and the commit-log replay needed on recovery. Merging compaction reads a few SSTables plus the memtable and writes one new SSTable, bounding how many files a read must merge. Major compaction is a merging compaction that rewrites all of a tablet's SSTables into exactly one with no deletion entries and no deleted data [Chang et al. 2006, Section 5.4]. A delete cannot overwrite, so it writes a deletion entry (a tombstone) into a newer SSTable that suppresses older live values during the merged read. The data is physically removed only when a major compaction produces an SSTable without any deletion entries. Bigtable cycles major compactions over all tablets so deleted data disappears in bounded time, which matters for sensitive data. This is the direct ancestor of RocksDB tombstones and full compaction.
The paper says its design is "analogous to" the log-structured merge tree. Make that connection precise, and name what immutability buys beyond cheap writes.
Model answer
Bigtable's memtable-plus-immutable-SSTable design with compaction is an LSM-tree storage engine: writes buffer in a sorted in-memory structure, flush to immutable sorted files, and reads merge memory and disk. Section 10 of the paper states explicitly that this is "analogous to the way that the Log-Structured Merge Tree stores updates," citing O'Neil et al. 1996 [Chang et al. 2006, Section 10]. Immutability buys three things beyond cheap sequential writes (Section 6). Reads need no file-system synchronization, so row concurrency control is cheap, the only mutable shared structure being the copy-on-write memtable. Permanent deletion becomes mark-and-sweep garbage collection of obsolete SSTables. And tablet splits are cheap because the child tablets share the parent's SSTables instead of rewriting them. The cost is read amplification and space amplification, the same trade-off the RUM conjecture describes: a read may have to merge many SSTables, and a small random read that ships a full 64 KB block to return a 1 KB value is Bigtable's worst case (Section 7).
Check yourself
Which statement about LSM trees versus B+trees is false?
The false one is the strict-faster claim. The LSM tree buys cheap writes at the cost of read amplification and space amplification, plus background compaction I/O; B+trees win read-heavy and point-lookup workloads. The right answer is workload-dependent, not a blanket ordering. The other three describe the write batching, the in-place B+tree cost, and the merged read correctly.
Which statement about a Bloom filter on an SSTable is false?
A Bloom filter never gives false negatives. A 0 bit at any of the key's hash positions proves the key was never inserted, so absence is certain; the structure errs only by false positives. That one-sided guarantee is exactly why "definitely not" is safe for skipping files. The other three statements are all true.
Which statement about leveled and tiered compaction is false?
The false one reverses the trade-off. Leveled minimizes space and read amplification at the cost of higher write amplification (often over ten in RocksDB), because a byte is rewritten roughly once per level it descends. Tiered minimizes write amplification at the cost of read and space amplification. The first three statements are correct.
Which statement about row versus column storage is false?
The false one is the OLTP claim. A single-row write under a column-store scatters that one row across many column segments, so the row-store wins there. Column-stores win scan-and- aggregate OLAP over a few columns. The other three statements correctly describe projection, compression, and the row layout.
Which statement about DuckDB's vectorized execution is false?
The false one is the compilation claim. DuckDB queries are still interpreted; vectorization amortizes per-tuple interpreter overhead by working on a 2048-value vector per call, which is a different technique from compiling the query to machine code. The other three statements are accurate.
In Bigtable, what happens at the instant a write is durable?
Durability comes from the commit log. The tablet server writes the mutation to the commit log (with group commit) and only after that commit inserts it into the memtable, which is an in-memory structure and would be lost on a crash without the log. The flush and the major compaction happen much later and are about read cost and space, not durability.
The canonical production LSM write path: commit log plus memtable plus immutable SSTables, with minor, merging, and major compaction and tombstones. Focus on Section 5.3 (the write and read paths), Section 5.4 (the three compactions), and Section 6 (Bloom filters and exploiting immutability). It is also next week's bridge into the distributed world. research.google.com/archive/bigtable-osdi06.pdf
Want to push harder? Ask me to derive the (0.6185)^(m/n) Bloom-filter optimum, to walk a concrete leveled compaction cascade with the multiplier of 10, or to trace our running query through a DuckDB vectorized pipeline with a selection vector. I am your teacher for this course, not just a document.