Field Guide / Reference / Storage engine trade-offs
Cheat sheet

LSM tree vs B-tree

A B+tree updates in place and optimizes reads. An LSM tree buffers writes in memory, flushes immutable sorted files, and merges them in the background, trading read and space amplification for cheap sequential writes. Neither is strictly faster: the right answer is workload-dependent.

The one thing to remember

B+tree: one logarithmic random write per insert, one logarithmic random read per lookup, balanced and update-in-place. LSM: writes are sequential and cheap, but you pay on reads (probe several sorted runs) and on space (dead data lingers until compaction). This is the read-update-memory (RUM) conjecture in practice: optimize at most two of read, write, and space amplification.

Side by side

Update-in-place B+tree versus log-structured merge tree, axis by axis
AxisB+tree (update-in-place)LSM tree (RocksDB, LevelDB, Bigtable)
Write path Find leaf (often a random read, since the needed leaf is rarely cached), modify page in place, write page back, plus a WAL record and eventual checkpoint. Small random writes; splits propagate upward. Append to WAL, insert into in-memory sorted memtable (skip list). When full, freeze the memtable and flush it whole as a new immutable SSTable. All sequential. No in-place mutation ever.
Read path One root-to-leaf traversal, about logb(n) page reads, mostly cache hits in upper levels. One sorted structure, so a lookup is deterministic and read-optimized. Check memtable, then SSTables newest to oldest (the newest copy wins). May touch O(number of sorted runs) files. Per-SSTable sparse block index plus a Bloom filter cut this down.
Update / delete Overwrite the value in place; delete removes the entry and may merge underfull leaves. Write a new entry in a newer SSTable; delete writes a tombstone. Old versions and tombstones are dropped only during compaction.
Write amplification Low-to-moderate: one page rewrite per insert plus WAL, but every small change pays a full page write (and triggers SSD garbage collection). High: a byte is rewritten once per level it descends. RocksDB leveled compaction is "often larger than 10". Tiered is much lower.
Read amplification Low: about one logarithmic-depth traversal of a single tree. Higher: a lookup may probe multiple files across multiple levels. Bloom filters reduce it toward one real file read for point lookups.
Space amplification Low for live data, but partly full leaves and (in MVCC heaps) dead row versions cost space; PostgreSQL pays this as VACUUM. Dead (superseded) data sits in older SSTables until compaction reclaims it. Leveled keeps it bounded; tiered lets it grow and spikes transiently during merges.
I/O pattern Random reads then random writes. Catastrophic on spinning disks, costs endurance on flash. Sequential appends and large sequential merges. Turns small random writes into big sequential ones, the whole point.
Concurrency / durability Page latches; WAL protects pages (undo/redo). Locking on shared pages. Immutable files need no read synchronization; only the memtable is mutable (copy-on-write). WAL protects only the volatile memtable.
When it wins Read-heavy and point-lookup workloads, range scans, mixed OLTP, predictable low-latency reads. Write-heavy ingest, high insert/update rates, time-series and logging, flash endurance sensitivity.
Exam traps

"LSM is strictly faster than B-tree." False. LSM optimizes writes at the cost of read and space amplification, and compaction burns background I/O and CPU. B+trees win read-heavy and point-lookup workloads.

"Leveled compaction has lower write amplification than tiered." False, it is the reverse. Leveled minimizes space and read amplification at higher write amplification (>10 in RocksDB); tiered minimizes write amplification at higher read and space amplification.

The three amplifications

TermDefinitionDriven by
Write amplification Bytes physically written / bytes logically inserted. Number of times compaction rewrites a byte. Grows with the number of levels.
Read amplification Files or pages read to answer one lookup. Number of sorted runs probed. Cut by Bloom filters and per-SSTable indexes.
Space amplification Bytes stored on disk / bytes of live data. Dead and superseded data awaiting compaction; transient copies during a merge.
Why you cannot win all three

The RUM conjecture says pick at most two of read, update (write), and memory (space) amplification. RocksDB states it directly: leveled "minimizes space amplification at the cost of read and write amplification"; universal (tiered) "minimizes write amplification at the cost of read and space amplification". The compaction policies are just points on this curve.

Leveled versus tiered compaction

LEVELED one sorted run per non-zero level, non-overlapping key ranges L0 may overlap L1 1 run, ~10x L2 ~100x + read checks at most one file per level + dead data bounded (low space amp) − byte rewritten once per level (write amp >10) target = base x multiplier (default 10): 16KB → 160KB → 1.6MB → 16MB Favors read-heavy workloads. TIERED (size-tiered / universal) several similar-sized runs per tier, merged up when enough pile up T0 4 runs T1 2 runs T2 1 run + few rewrites per byte (low write amp) − read may check many runs (high read amp) − two copies of a big run during a merge (space amp) Favors write-heavy workloads.
Leveled keeps each non-zero level as one sorted run with non-overlapping SSTable ranges, so a read checks at most one file per level (low read and space amplification) but rewrites a byte once per level (write amplification >10). Tiered lets several similar sized runs accumulate per tier and merges them up when enough collect, so few rewrites (low write amplification) but a read may scan many runs and a merge transiently doubles a run's space. L0 is special in leveled: files come straight from memtable flushes and may overlap. RocksDB also has a tiered+leveled hybrid with less write amp than leveled and less space amp than tiered.
Compaction policy at a glance
PolicyWrite ampRead ampSpace ampBest for
Leveledhigh (>10)lowlowread-heavy
Tiered (universal)lowhighhighwrite-heavy
Tiered + leveled (hybrid)mediummediummediumbalanced

Bloom filter: making LSM reads cheap

A Bloom filter is a probabilistic set-membership structure that answers "is key X possibly in this SSTable?" with either definitely not or maybe. It has no false negatives, only false positives. That asymmetry is exactly what an LSM read needs: a "definitely not" skips an entire SSTable with zero I/O, and a false positive only costs one wasted lookup that returns nothing.

test key "X" X h1 h2 h3 m-bit array 0 1 0 1 0 1 0 all k bits set → maybe present, read the SSTable any bit 0 → definitely absent, skip the file with zero I/O false-positive rate at optimal k = (m/n) ln 2 is about (0.6185)^(m/n). RocksDB default ~10 bits/key (m/n = 10) gives ~1% false positives.
A bit array of m bits and k hash functions. Insert sets the k bits; a test that finds any of the k bits clear proves the key is absent. More bits per key lowers the false-positive rate. Standard Bloom filters are not deletable: clearing a bit could clear one shared with another key, which would create false negatives. In Bigtable the filter is per locality group; in RocksDB it is per SSTable.
Bloom filter traps

"Bloom filters can give false negatives." False. A "definitely not" is always correct; only false positives occur. That is exactly why they are safe for skipping SSTables.

"You can delete a key by clearing its bits." False. Clearing a shared bit introduces false negatives. Deletion needs a counting Bloom filter, not the standard one.

Where the real systems sit

LSM-family engines and a B-tree baseline
SystemFamilyWhat grounds it
LevelDB LSM, leveled The compact reference embedded LSM: memtable plus WAL plus immutable SSTables plus leveled compaction. The lineage RocksDB forked from.
RocksDB LSM, pluggable The reference production LSM: memtable, WAL, immutable SSTables, pluggable compaction (leveled, universal/tiered, FIFO), per-SSTable Bloom filters, block cache. The storage engine embedded inside many distributed databases.
Bigtable LSM, distributed The canonical production LSM and the bridge to the distributed world. Writes go to a per-server commit log plus a memtable; the memtable flushes to an immutable SSTable; reads merge the memtable with SSTables. Minor / merging / major compaction reclaim space and remove tombstones. The paper states the design is "analogous to the way the Log-Structured Merge Tree stores updates". Tablets are dynamic range partitions on the row key across tablet servers.
PostgreSQL, SQLite B-tree baseline Update-in-place row-stores over a B+tree. PostgreSQL keeps old MVCC versions in the heap (paying VACUUM) rather than in stacked SSTables; it chose read-optimized update-in-place over the LSM write-optimized path. SQLite is a B-tree store whose WAL mode echoes the LSM append-on-commit idea.
Bigtable compaction trio (the original-source vocabulary)

Minor compaction flushes a frozen memtable to a new SSTable on GFS, shrinking memory and recovery cost. Merging compaction folds a few SSTables plus the memtable into one new SSTable to bound how many files a read must merge. Major compaction rewrites all SSTables into exactly one, dropping every deletion entry and deleted value, so sensitive data physically disappears in bounded time. Deletes are tombstones that suppress older live values until a major compaction removes them.

Picking one in an interview

If the workload is write-heavy, append-mostly, or on flash you care about endurance, reach for an LSM and tune compaction toward tiered. If it is read-heavy with point lookups and range scans and you want predictable low read latency, reach for a B+tree (or an LSM with leveled compaction and fat Bloom filters). Then name the cost you are accepting: read and space amplification for the LSM, write amplification and random I/O for the B-tree.