LSM tree vs B-tree · Reference · Field Guide to the Database Engine

The one thing to remember

B+tree: one logarithmic random write per insert, one logarithmic random read per lookup, balanced and update-in-place. LSM: writes are sequential and cheap, but you pay on reads (probe several sorted runs) and on space (dead data lingers until compaction). This is the read-update-memory (RUM) conjecture in practice: optimize at most two of read, write, and space amplification.

Side by side

Update-in-place B+tree versus log-structured merge tree, axis by axis
Axis	B+tree (update-in-place)	LSM tree (RocksDB, LevelDB, Bigtable)
Write path	Find leaf (often a random read, since the needed leaf is rarely cached), modify page in place, write page back, plus a WAL record and eventual checkpoint. Small random writes; splits propagate upward.	Append to WAL, insert into in-memory sorted memtable (skip list). When full, freeze the memtable and flush it whole as a new immutable SSTable. All sequential. No in-place mutation ever.
Read path	One root-to-leaf traversal, about log_b(n) page reads, mostly cache hits in upper levels. One sorted structure, so a lookup is deterministic and read-optimized.	Check memtable, then SSTables newest to oldest (the newest copy wins). May touch O(number of sorted runs) files. Per-SSTable sparse block index plus a Bloom filter cut this down.
Update / delete	Overwrite the value in place; delete removes the entry and may merge underfull leaves.	Write a new entry in a newer SSTable; delete writes a tombstone. Old versions and tombstones are dropped only during compaction.
Write amplification	Low-to-moderate: one page rewrite per insert plus WAL, but every small change pays a full page write (and triggers SSD garbage collection).	High: a byte is rewritten once per level it descends. RocksDB leveled compaction is "often larger than 10". Tiered is much lower.
Read amplification	Low: about one logarithmic-depth traversal of a single tree.	Higher: a lookup may probe multiple files across multiple levels. Bloom filters reduce it toward one real file read for point lookups.
Space amplification	Low for live data, but partly full leaves and (in MVCC heaps) dead row versions cost space; PostgreSQL pays this as VACUUM.	Dead (superseded) data sits in older SSTables until compaction reclaims it. Leveled keeps it bounded; tiered lets it grow and spikes transiently during merges.
I/O pattern	Random reads then random writes. Catastrophic on spinning disks, costs endurance on flash.	Sequential appends and large sequential merges. Turns small random writes into big sequential ones, the whole point.
Concurrency / durability	Page latches; WAL protects pages (undo/redo). Locking on shared pages.	Immutable files need no read synchronization; only the memtable is mutable (copy-on-write). WAL protects only the volatile memtable.
When it wins	Read-heavy and point-lookup workloads, range scans, mixed OLTP, predictable low-latency reads.	Write-heavy ingest, high insert/update rates, time-series and logging, flash endurance sensitivity.

Exam traps

"LSM is strictly faster than B-tree." False. LSM optimizes writes at the cost of read and space amplification, and compaction burns background I/O and CPU. B+trees win read-heavy and point-lookup workloads.

"Leveled compaction has lower write amplification than tiered." False, it is the reverse. Leveled minimizes space and read amplification at higher write amplification (>10 in RocksDB); tiered minimizes write amplification at higher read and space amplification.

The three amplifications

Term	Definition	Driven by
Write amplification	Bytes physically written / bytes logically inserted.	Number of times compaction rewrites a byte. Grows with the number of levels.
Read amplification	Files or pages read to answer one lookup.	Number of sorted runs probed. Cut by Bloom filters and per-SSTable indexes.
Space amplification	Bytes stored on disk / bytes of live data.	Dead and superseded data awaiting compaction; transient copies during a merge.

Why you cannot win all three

The RUM conjecture says pick at most two of read, update (write), and memory (space) amplification. RocksDB states it directly: leveled "minimizes space amplification at the cost of read and write amplification"; universal (tiered) "minimizes write amplification at the cost of read and space amplification". The compaction policies are just points on this curve.

Leveled versus tiered compaction

Leveled keeps each non-zero level as one sorted run with non-overlapping SSTable ranges, so a read checks at most one file per level (low read and space amplification) but rewrites a byte once per level (write amplification >10). Tiered lets several similar sized runs accumulate per tier and merges them up when enough collect, so few rewrites (low write amplification) but a read may scan many runs and a merge transiently doubles a run's space. L0 is special in leveled: files come straight from memtable flushes and may overlap. RocksDB also has a tiered+leveled hybrid with less write amp than leveled and less space amp than tiered.

Compaction policy at a glance
Policy	Write amp	Read amp	Space amp	Best for
Leveled	high (>10)	low	low	read-heavy
Tiered (universal)	low	high	high	write-heavy
Tiered + leveled (hybrid)	medium	medium	medium	balanced

Bloom filter: making LSM reads cheap

A Bloom filter is a probabilistic set-membership structure that answers "is key X possibly in this SSTable?" with either definitely not or maybe. It has no false negatives, only false positives. That asymmetry is exactly what an LSM read needs: a "definitely not" skips an entire SSTable with zero I/O, and a false positive only costs one wasted lookup that returns nothing.

A bit array of m bits and k hash functions. Insert sets the k bits; a test that finds any of the k bits clear proves the key is absent. More bits per key lowers the false-positive rate. Standard Bloom filters are not deletable: clearing a bit could clear one shared with another key, which would create false negatives. In Bigtable the filter is per locality group; in RocksDB it is per SSTable.

Bloom filter traps

"Bloom filters can give false negatives." False. A "definitely not" is always correct; only false positives occur. That is exactly why they are safe for skipping SSTables.

"You can delete a key by clearing its bits." False. Clearing a shared bit introduces false negatives. Deletion needs a counting Bloom filter, not the standard one.

Where the real systems sit

LSM-family engines and a B-tree baseline
System	Family	What grounds it
LevelDB	LSM, leveled	The compact reference embedded LSM: memtable plus WAL plus immutable SSTables plus leveled compaction. The lineage RocksDB forked from.
RocksDB	LSM, pluggable	The reference production LSM: memtable, WAL, immutable SSTables, pluggable compaction (leveled, universal/tiered, FIFO), per-SSTable Bloom filters, block cache. The storage engine embedded inside many distributed databases.
Bigtable	LSM, distributed	The canonical production LSM and the bridge to the distributed world. Writes go to a per-server commit log plus a memtable; the memtable flushes to an immutable SSTable; reads merge the memtable with SSTables. Minor / merging / major compaction reclaim space and remove tombstones. The paper states the design is "analogous to the way the Log-Structured Merge Tree stores updates". Tablets are dynamic range partitions on the row key across tablet servers.
PostgreSQL, SQLite	B-tree baseline	Update-in-place row-stores over a B+tree. PostgreSQL keeps old MVCC versions in the heap (paying VACUUM) rather than in stacked SSTables; it chose read-optimized update-in-place over the LSM write-optimized path. SQLite is a B-tree store whose WAL mode echoes the LSM append-on-commit idea.

Bigtable compaction trio (the original-source vocabulary)

Minor compaction flushes a frozen memtable to a new SSTable on GFS, shrinking memory and recovery cost. Merging compaction folds a few SSTables plus the memtable into one new SSTable to bound how many files a read must merge. Major compaction rewrites all SSTables into exactly one, dropping every deletion entry and deleted value, so sensitive data physically disappears in bounded time. Deletes are tombstones that suppress older live values until a major compaction removes them.

Picking one in an interview

If the workload is write-heavy, append-mostly, or on flash you care about endurance, reach for an LSM and tune compaction toward tiered. If it is read-heavy with point lookups and range scans and you want predictable low read latency, reach for a B+tree (or an LSM with leveled compaction and fat Bloom filters). Then name the cost you are accepting: read and space amplification for the LSM, write amplification and random I/O for the B-tree.