Field Guide / Part VIII · Modern and distributed / Week 15
Modern architectures

Modern architectures: LSM trees, columnar storage, DuckDB

For fourteen weeks we built one engine: a B+tree over a buffer pool over slotted pages, write-ahead logged and MVCC concurrent. This week we put that engine under two pressures it was not built for, write-heavy ingest and column-wide analytic scans, and watch the comfortable design lose to two others.

By the end you can

  • Explain write amplification on an update-in-place B+tree and why the LSM tree turns random writes into sequential ones.
  • Walk the LSM read and write paths, and reason about the read, write, and space amplification trade-off (the RUM conjecture) when you switch compaction policy.
  • Say why a Bloom filter is safe for skipping SSTables, and exactly what it can and cannot get wrong.
  • Contrast row (NSM) and column (DSM) layout and explain why columnar plus vectorized execution wins OLAP, and DuckDB as a concrete embedded example.
  • Defend the Bigtable write path in a viva, and pick out the false statement an MCQ will hide.

This is the first of two weeks that do not add a new layer. They revisit the whole stack under new pressure. Recall the two journeys. The read journey took SELECT name FROM emp WHERE salary > 75000 ORDER BY name down through the parser (week 8), the optimizer (week 11), the Volcano executor (weeks 9 and 10), the B+tree index (weeks 6 and 7), the buffer pool (weeks 4 and 5), and the slotted heap pages (weeks 2 and 3). The write journey ran an UPDATE through MVCC (week 13) and the ARIES write-ahead log (week 14).

Nothing new is handed in from above and nothing new is handed down. Instead three subsystems get replaced. The update-in-place storage and index layers (weeks 2, 3, 6, 7) give way to a log-structured design: a memtable plus a write-ahead log plus immutable sorted files plus background compaction. The row-oriented slotted page gives way to a column layout for analytic scans. And the pull-based Volcano iterator of week 9 gives way to vectorized, push-based execution. The buffer pool, the WAL durability idea, and MVCC all reappear, in new clothes. Next week, week 16, takes this same machinery and spreads it across machines, using the same paper (Bigtable) as the bridge.

Intuition

A B+tree edits the disk in place: to add one row you find its leaf, read that page, change a few bytes, write the page back. When writes pour in faster than the working set fits in cache, the leaf you need is almost never the leaf in memory, so each small insert becomes a random read followed by a random write. The LSM tree refuses to edit in place. It collects writes in memory, then dumps them to disk in one big sorted file, sequentially, and cleans up later in the background. Many tiny random writes become a few large sequential ones. You pay for that on reads, because the newest copy of a key could be in any of those files.

Three pressures on the comfortable design

The single-node engine we built is read-optimized and row-oriented. Three workloads break it, and each one points to a different fix.

Why the B+tree loses on write-heavy ingest

A single logical insert of a small row into a B+tree can force a page read, an in-memory modification, a page write, a WAL record, and eventual checkpoint flushing. Under a write-heavy load the leaf you need is rarely cached, so inserts become random reads followed by random writes. On spinning disks random writes are the worst case; even on flash they burn endurance and trigger the device's own garbage collection. The B+tree cannot turn those random writes into sequential ones, because its whole correctness story is keeping the on-disk tree sorted and balanced at all times. The fix is to stop keeping the on-disk structure perfectly sorted at all times and instead batch and defer. That is the log-structured merge tree [O'Neil, Cheng, Gawlick, O'Neil 1996].

The second pressure is the analytic scan. A row-store keeps a tuple's columns contiguous, which is ideal when a query reads whole rows: fetch this order, update this account. An analytic query does the opposite. It touches two or three columns of a billion-row table and aggregates them. A row layout drags every other column through the memory hierarchy to reach the two it needs, wasting bandwidth and cache lines. The fix is to lay data out by column, which makes column projection free at the storage layer and unlocks far better compression.

The third pressure is that one machine has finite disk, finite memory, and a single failure domain. That is next week. For now the lesson is that no amount of tuning the buffer pool or the B+tree addresses the first two pressures. They are the wrong shape.

write path (all sequential) WAL append-only memtable sorted, in RAM flush when full SSTable (immutable) compaction merges sorted files in the background fewer, larger sorted runs read path (the price) get(key) check memtable, then SSTables newest first Bloom filter per SSTable "definitely not" skips a file with zero I/O newest copy of a key wins; tombstones mark deletes
Figure 1. The LSM split. Writes only ever append (to the WAL and the in-memory memtable) and flush whole sorted files; nothing is edited in place. Reads pay for that by potentially consulting many files, which Bloom filters and compaction keep bounded.

The log-structured merge tree

The LSM tree was introduced by O'Neil, Cheng, Gawlick, and O'Neil in 1996 as a multi-component structure: a memory-resident component and one or more disk-resident components, with a rolling merge that continuously migrates entries from the smaller component to the larger one [Acta Informatica 1996]. The point is to defer and batch index changes so that many logical inserts are amortized into one large sequential disk pass. Modern implementations (RocksDB, LevelDB, Cassandra, Bigtable) settled on four concrete pieces.

The write path is therefore append to the WAL, insert into the memtable, and occasionally flush an SSTable. All sequential. The cost is paid on reads and on compaction. A point lookup must check the memtable, then potentially several SSTables from newest to oldest, because the key could live in any of them and the newest copy wins. Without help, a read touches one file per sorted run. Two mechanisms cut that down: a per-SSTable block index turns a within-file lookup into a binary search, and a per-SSTable Bloom filter lets a read skip files that cannot contain the key without any I/O.

Keep this one thing

A B+tree does roughly one logarithmic-depth random write per insert and one random read per lookup: balanced, update-in-place, read-optimized. An LSM tree makes writes cheap and sequential but pays in read amplification and space amplification, plus background compaction I/O. Neither is strictly better. The choice is the workload.

The three amplifications and the RUM conjecture

An LSM tree pays its write debt in three different currencies, and you cannot minimize all three at once.

The three amplifications, what each measures, and which compaction policy each favors.
AmplificationWhat it measuresPushed up by
Writebytes physically written / bytes logically insertedleveled compaction (often >10 in RocksDB)
Readfiles or pages read to answer one lookuptiered compaction (many runs to probe)
Spacebytes on disk / bytes of live datatiered compaction (dead data lingers, transient copies)

This is the read-update-memory (RUM) conjecture in practice: you optimize at most two of read, update (write), and memory (space). RocksDB states the policies as direct trade-offs. Leveled compaction "minimizes space amplification at the cost of read and write amplification"; universal (tiered) compaction "minimizes write amplification at the cost of read and space amplification" [RocksDB wiki, Compaction].

Why leveled and tiered sit at opposite ends

Tiered (size-tiered, RocksDB's "universal") groups SSTables of similar size. When enough runs of one size accumulate, they merge into one larger run at the next tier, so each level holds several sorted runs. Few rewrites per byte means low write amplification, but a read may have to check many runs (high read amplification), and two copies of a large run can coexist during a merge (high transient space amplification). Leveled organizes data into levels L0, L1, L2, and so on, where every non-zero level is a single sorted run, range-partitioned across fixed-size files with non-overlapping key ranges. Each level has a size target a fixed multiplier larger than the one above it (the default multiplier is 10, giving roughly 16 KB, 160 KB, 1.6 MB, 16 MB targets from a 16 KB base) [RocksDB wiki, Leveled Compaction]. A read checks at most one file per level (low read amplification) and dead data is bounded (low space amplification), but a byte is rewritten roughly once per level it descends, so write amplification grows with the number of levels. One line: tiered favors write-heavy workloads and accepts slower reads; leveled favors read-heavy workloads and accepts more write churn.

LSM write and compaction visualizer stream writes, flush SSTables, compact, and watch the three amplifications move
Experiment: hold the same write stream and flip compaction policy. Tiered keeps many runs per level (cheap writes, expensive reads). Leveled keeps one run per non-zero level (cheap reads, more rewrites). Then drag bits-per-key on the Bloom filter and run a negative lookup: watch the number of SSTables actually probed collapse.
Exam trap

"Leveled compaction has lower write amplification than tiered" is false, and it is the reverse of the truth. Leveled minimizes space and read amplification at the cost of higher write amplification (the >10 figure in RocksDB); tiered minimizes write amplification at the cost of higher read and space amplification [RocksDB wiki]. A second trap: "LSM trees are strictly faster than B-trees." Also false. LSM optimizes writes and pays on reads and space, and compaction consumes background I/O and CPU. The honest answer is workload-dependent.

Bloom filters: the asymmetry that makes them safe

A Bloom filter answers one question about an SSTable: is key X possibly in this file? It answers either "definitely not" or "maybe". It never produces a false negative, only a false positive. That asymmetry is exactly what an LSM read needs. A "definitely not" lets the reader skip an entire SSTable with zero I/O, and a false positive only costs a wasted lookup that returns nothing.

The mechanics are a bit array of m bits and k independent hash functions. To insert a key, hash it k ways and set those k bits. To test a key, hash it k ways and check those k bits. If any is 0 the key is absent; if all are 1 the key is probably present. The false-positive probability after inserting n keys is approximately (1 - e^(-kn/m))^k, minimized at k = (m/n) ln 2, and at that optimum the false-positive rate is about (0.6185)^(m/n) [Bloom filter false-positive derivation]. RocksDB's common configuration is about 10 bits per key, which gives a false-positive rate near 1 percent.

bit array (m bits), k hashes 1 0 1 1 0 1 0 1 test key Y -> hashes hit a 0 bit definitely NOT here test key Z -> all k bits are 1 MAYBE here (verify) no false negatives ever; only false positives, tuned by m/n cannot delete a key by clearing bits (a bit may be shared)
Figure 2. The one-sided guarantee. A 0 bit proves absence, so "definitely not" is always correct and safe for skipping a file. All bits set only means "maybe," so a match must still be verified by reading the SSTable.
Exam trap

"Bloom filters can give false negatives" is false. A 0 bit means at least one of the key's hash positions was never set, which is impossible if the key were present, so absence is certain. They give only false positives. A related trap: you cannot delete a key from a standard Bloom filter by clearing its bits, because a bit might be shared with another key, which would introduce false negatives. Deletion needs a counting Bloom filter, not the standard one.

Columnar storage: transpose the table

A row-store (the n-ary storage model, NSM) stores the columns of one tuple contiguously: row 1's a, b, c, then row 2's a, b, c. This is the slotted page from week 2. A column-store (the decomposition storage model, DSM) stores all values of column a contiguously, then all of column b, then all of column c. The same logical table, physically transposed. For the running example SELECT name FROM emp WHERE salary > 75000, an analytic version over a fifty-column emp table touches only name and salary. Under DSM the scan reads two column segments and ignores forty-eight. Under NSM it must read every page, because each page carries all fifty columns, to extract the two.

NSM (row-store): tuples contiguous name sal dept hire a 2-column scan still drags dept, hire through cache DSM (column-store): columns contiguous name salary dept hire the scan reads only the two highlighted segments; each column compresses well (one type)
Figure 3. The same table laid out two ways. Column projection becomes free at the storage layer under DSM, and a single-type segment compresses far better (run-length, dictionary, bit-packing) than a heterogeneous row.

Columnar wins OLAP for three reasons that compound. First, I/O and bandwidth: a query reading 2 of 50 columns reads only those 2 segments. Second, compression: a column holds one type and often low cardinality or sortedness, so run-length encoding, dictionary encoding, bit-packing, and frame-of-reference all work far better than on a heterogeneous row, which means fewer bytes moved and more data per cache line. Third, vectorized scans: with one type laid out contiguously, the engine processes a batch of values in a tight loop the CPU can keep in registers and SIMD lanes.

Exam trap

"Columnar storage is always better" is false. Row-stores still win OLTP. A point insert, update, or delete of a whole tuple touches one contiguous place in NSM, whereas a column-store would scatter that one row across many column segments. The row-versus-column choice maps onto the OLTP-versus-OLAP split, which is why hybrid (HTAP) systems keep both layouts.

Vectorized, push-based execution and DuckDB

Recall the Volcano model from week 9: every operator is an iterator with open, next, close, and the root pulls one tuple at a time up the tree. The cost is one virtual call per operator per tuple. For an analytic scan over a billion rows, that per-tuple interpreter overhead dominates. Vectorized execution amortizes it. Instead of one tuple per next call, an operator processes a whole vector (a batch) of values in one call, so the interpreter dispatch is paid once per batch, not once per row.

DuckDB is the concrete embedded example. It is an in-process analytical database, deliberately "SQLite for analytics": it "does not run as a separate process, but completely embedded within a host process," and its engine is a "columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a vector) are processed in one operation" [DuckDB, why_duckdb]. The unit of work is a Vector (one column's values) and a DataChunk (a set of vectors forming a horizontal slice of columns), with a default vector size of 2048 tuples [DuckDB docs, Execution Format]. DuckDB carries several physical vector encodings through execution: flat (uncompressed), constant (one repeated value), dictionary (a child vector plus a selection vector of indices), and sequence (offset plus increment, for row ids). Keeping data dictionary-encoded inside the engine is late materialization in action: a filter runs over the compressed column directly, producing a selection vector of qualifying positions, and full rows are only stitched together at the very end.

DuckDB also moved from the pull-based Volcano model (which it used before 2021) to a push-based model, where source operators push chunks up into the pipeline. Push makes it easier to add operators and to run several pipelines concurrently, and it pairs with morsel-driven parallelism, where work is split into small morsels of a pipeline's input that worker threads grab dynamically. And despite being columnar and analytic, DuckDB is still transactional: it uses a bulk-optimized, HyPer-style MVCC for ACID, the same multi-version idea from week 13 in new clothes.

Exam trap

"Vectorized execution means compiling the query to machine code" is false, at least for DuckDB. DuckDB queries are still interpreted. Vectorization means processing a batch (a vector, default 2048 values) per operator call to amortize interpreter overhead, which is distinct from query compilation [DuckDB, why_duckdb]. Two different techniques attack the same per-tuple-overhead problem; do not conflate them.

Row versus column scan race pick how many columns a query touches and watch the bytes actually read
Experiment: drop the number of touched columns toward 2 and watch the column-store bar shrink while the row-store bar stays full (it must read every column on every page). Then add compression to the column segments and watch only the column bar shrink further.
In real systems

Core PostgreSQL is a row-store with an update-in-place heap and a B+tree (nbtree), not an LSM tree; its MVCC keeps old row versions in the heap rather than in stacked SSTables, which is exactly why it needs VACUUM (week 3). Columnar and LSM behavior in the PostgreSQL world come from extensions and forks (column-store extensions, OrioleDB), a useful contrast: PostgreSQL chose read-optimized update-in-place and pays the VACUUM cost, rather than the LSM write-optimized path. SQLite is the embedded OLTP row-store baseline that DuckDB consciously mirrors for OLAP, and SQLite's WAL mode maps cleanly onto the LSM append idea even though SQLite is a B-tree store: in WAL mode a COMMIT appends a commit record to a separate WAL file, and a checkpoint later moves those pages into the main database file [SQLite, Write-Ahead Logging]. The mnemonic: SQLite is embedded row-store OLTP, DuckDB is embedded column-store OLAP.

The LSM write and read paths, step by step

  1. Write: append the mutation to the WAL, then insert it into the sorted in-memory memtable. Both are sequential. The write is durable once the log write commits.
  2. Flush: when the memtable hits its size threshold, freeze it, start a fresh memtable, and write the frozen one to disk as a new immutable SSTable. This is Bigtable's minor compaction.
  3. Compact: a background pass merges several SSTables (and the memtable) into one, dropping superseded versions and tombstones, bounding the number of files a read must merge.
  4. Read: merge the memtable with the SSTables from newest to oldest; the newest copy of the key wins. A per-SSTable Bloom filter skips files that cannot hold the key, and a block index turns the within-file search into a binary search.
  5. Delete: write a tombstone into a new SSTable that suppresses older live values during the merge. The data physically disappears only when a major compaction rewrites everything into one file with no tombstones.
# LSM put and get, the shape RocksDB and Bigtable share
def put(key, value):
    wal.append(key, value)            # sequential, durable
    memtable.insert(key, value)       # sorted, in memory
    if memtable.size >= THRESHOLD:
        flush(memtable)               # freeze -> new immutable SSTable

def get(key):
    if key in memtable: return memtable[key]
    for sst in sstables_newest_first():
        if not sst.bloom.might_contain(key): continue   # zero I/O skip
        v = sst.lookup(key)           # binary search via block index
        if v is TOMBSTONE: return NOT_FOUND
        if v is not None: return v
    return NOT_FOUND
Deeper: why immutability is the load-bearing property

Everything good about the LSM read and write paths follows from SSTables never being edited in place. Because SSTables are immutable, reads need no file-system synchronization, so concurrency control on stored data is nearly free; the only mutable structure read and written together is the memtable, which Bigtable makes copy-on-write so reads and writes run in parallel [Chang et al. 2006, Section 6]. Deletion becomes garbage collection of obsolete SSTables, done as mark-and-sweep. And a range split is cheap because the two children share the parent's SSTables instead of rewriting them. The price of immutability is precisely the read and space amplification we have been measuring: you cannot overwrite a stale value, so it lingers in an old file until compaction removes it, and a read may have to look past several stale copies to find the live one.

Paper viva: Bigtable

This week's assigned paper is "Bigtable: A Distributed Storage System for Structured Data" (Chang et al., OSDI 2006). It is the canonical production LSM write path and the bridge to next week's distributed material. The examiner will not ask you to recite it. They will ask you to walk its mechanism and defend why it is shaped that way. Try each before opening the model answer.

Viva prompt 1

Walk the Bigtable write path and the read path. Why is the read a cheap merge, and what keeps the number of files a read must touch bounded?

Model answer

On a write, the tablet server first checks the mutation is well-formed and the sender is authorized (reading the permitted-writers list from a Chubby file, almost always a cache hit), writes the mutation to the per-server commit log using group commit, and only after that commit inserts it into the in-memory sorted memtable [Chang et al. 2006, Section 5.3]. The data is durable once the log write commits; the memtable insert just makes it readable. A read executes over a merged view of the memtable plus the tablet's sequence of SSTables. Both the memtable and every SSTable are lexicographically sorted, so the read is an efficient merge of sorted streams rather than a scan of unsorted files. What keeps the file count bounded is compaction (Section 5.4): minor compaction flushes a full memtable to a new SSTable; merging compaction folds a few SSTables plus the memtable into one; major compaction rewrites all SSTables into exactly one. Per-locality-group Bloom filters and the block and scan caches further cut how many SSTables a read actually touches (Section 6).

Viva prompt 2

SSTables are immutable. Define the three compaction types, and explain how Bigtable deletes data given that nothing is ever overwritten in place.

Model answer

Minor compaction freezes a full memtable and writes it to GFS as a new SSTable, shrinking tablet-server memory and the commit-log replay needed on recovery. Merging compaction reads a few SSTables plus the memtable and writes one new SSTable, bounding how many files a read must merge. Major compaction is a merging compaction that rewrites all of a tablet's SSTables into exactly one with no deletion entries and no deleted data [Chang et al. 2006, Section 5.4]. A delete cannot overwrite, so it writes a deletion entry (a tombstone) into a newer SSTable that suppresses older live values during the merged read. The data is physically removed only when a major compaction produces an SSTable without any deletion entries. Bigtable cycles major compactions over all tablets so deleted data disappears in bounded time, which matters for sensitive data. This is the direct ancestor of RocksDB tombstones and full compaction.

Viva prompt 3

The paper says its design is "analogous to" the log-structured merge tree. Make that connection precise, and name what immutability buys beyond cheap writes.

Model answer

Bigtable's memtable-plus-immutable-SSTable design with compaction is an LSM-tree storage engine: writes buffer in a sorted in-memory structure, flush to immutable sorted files, and reads merge memory and disk. Section 10 of the paper states explicitly that this is "analogous to the way that the Log-Structured Merge Tree stores updates," citing O'Neil et al. 1996 [Chang et al. 2006, Section 10]. Immutability buys three things beyond cheap sequential writes (Section 6). Reads need no file-system synchronization, so row concurrency control is cheap, the only mutable shared structure being the copy-on-write memtable. Permanent deletion becomes mark-and-sweep garbage collection of obsolete SSTables. And tablet splits are cheap because the child tablets share the parent's SSTables instead of rewriting them. The cost is read amplification and space amplification, the same trade-off the RUM conjecture describes: a read may have to merge many SSTables, and a small random read that ships a full 64 KB block to return a 1 KB value is Bigtable's worst case (Section 7).

Check yourself

Which statement about LSM trees versus B+trees is false?

The false one is the strict-faster claim. The LSM tree buys cheap writes at the cost of read amplification and space amplification, plus background compaction I/O; B+trees win read-heavy and point-lookup workloads. The right answer is workload-dependent, not a blanket ordering. The other three describe the write batching, the in-place B+tree cost, and the merged read correctly.

Which statement about a Bloom filter on an SSTable is false?

A Bloom filter never gives false negatives. A 0 bit at any of the key's hash positions proves the key was never inserted, so absence is certain; the structure errs only by false positives. That one-sided guarantee is exactly why "definitely not" is safe for skipping files. The other three statements are all true.

Which statement about leveled and tiered compaction is false?

The false one reverses the trade-off. Leveled minimizes space and read amplification at the cost of higher write amplification (often over ten in RocksDB), because a byte is rewritten roughly once per level it descends. Tiered minimizes write amplification at the cost of read and space amplification. The first three statements are correct.

Which statement about row versus column storage is false?

The false one is the OLTP claim. A single-row write under a column-store scatters that one row across many column segments, so the row-store wins there. Column-stores win scan-and- aggregate OLAP over a few columns. The other three statements correctly describe projection, compression, and the row layout.

Which statement about DuckDB's vectorized execution is false?

The false one is the compilation claim. DuckDB queries are still interpreted; vectorization amortizes per-tuple interpreter overhead by working on a 2048-value vector per call, which is a different technique from compiling the query to machine code. The other three statements are accurate.

In Bigtable, what happens at the instant a write is durable?

Durability comes from the commit log. The tablet server writes the mutation to the commit log (with group commit) and only after that commit inserts it into the memtable, which is an in-memory structure and would be lost on a crash without the log. The flush and the major compaction happen much later and are about read cost and space, not durability.

State the RUM conjecture in one sentence, and place leveled and tiered compaction on it.
You can optimize at most two of read, update (write), and memory (space) amplification. Leveled minimizes read and space amplification at the cost of write amplification; tiered minimizes write amplification at the cost of read and space amplification.
Why is a Bloom filter's "definitely not" answer always safe, but its "maybe" not?
A 0 bit at any of the key's hash positions could only happen if the key was never inserted, so a negative is certain (no false negatives). All bits being 1 can happen by collision with other keys, so a positive is only "maybe" and must be verified by reading the file.
Primary source
Read this next: Bigtable: A Distributed Storage System for Structured Data (Chang et al., OSDI 2006)

The canonical production LSM write path: commit log plus memtable plus immutable SSTables, with minor, merging, and major compaction and tombstones. Focus on Section 5.3 (the write and read paths), Section 5.4 (the three compactions), and Section 6 (Bloom filters and exploiting immutability). It is also next week's bridge into the distributed world. research.google.com/archive/bigtable-osdi06.pdf

Ask your teacher

Want to push harder? Ask me to derive the (0.6185)^(m/n) Bloom-filter optimum, to walk a concrete leveled compaction cascade with the multiplier of 10, or to trace our running query through a DuckDB vectorized pipeline with a selection vector. I am your teacher for this course, not just a document.