Field Guide / Part VIII · Modern and distributed / Week 16
Distributed databases

Distributed databases: partitioning, replication, 2PC, CAP

Once data outgrows one machine and the network can drop messages, a transaction has to give something up. This last lesson takes the single-node engine you built over fifteen weeks and spreads it across machines, then names exactly what breaks and what you trade to fix it.

By the end you can

  • Choose hash versus range partitioning and explain why one fans out range queries and the other hotspots.
  • Reason about leader/follower and quorum replication, and read the W + R > N overlap rule.
  • Walk two-phase commit, locate the in-doubt state, and explain why a coordinator crash blocks.
  • State the CAP theorem precisely (consistency means linearizability, not the C in ACID) and add the PACELC else-branch.
  • Explain how consensus (Raft, equivalent to Multi-Paxos) replicates the coordinator log so a minority can fail without blocking.

Hold the two journeys of this course in mind. Journey 1 sent SELECT name FROM emp WHERE salary > 75000 ORDER BY name down the stack: parser (week 8), optimizer (week 11), executor (weeks 9 and 10), index (weeks 6 and 7), buffer pool (weeks 4 and 5), storage (weeks 2 and 3). Journey 2 sent UPDATE account SET balance = balance - 100 WHERE id = 42 through concurrency control (week 13) and the write-ahead log (week 14). Every one of those layers assumed a single machine. Last week (week 15) we kept the single machine but changed its storage shape to the log-structured merge tree.

This week does not add a layer below storage. It takes the whole engine and copies it onto many machines connected by a network that can drop and delay messages. Three new problems appear and each maps back onto something you already know. Splitting emp across machines is the hash versus B+tree trade-off from weeks 6 and 7, now at cluster scale. Keeping copies of a partition is durability (week 14) extended across nodes. Committing the UPDATE when account rows for two transfers live on two machines is atomicity (week 12) extended across nodes, and that is where the network can hurt you. The layer above hands in a transaction or a query; this layer decides which machines hold the rows, how many copies agree before a write returns, and what answer a read gets when the network splits the cluster in two.

Intuition

On one machine, "did the write happen" has a yes or no answer the moment fsync returns. Across machines, the question becomes "did enough of them agree, and can they still talk to each other." A dropped message turns a local fact into a negotiation. Distributed databases are mostly about what you do when that negotiation cannot finish in time.

Partitioning: split the rows by hash or by range

One node has finite disk, finite memory, and one failure domain. When the table or the request rate exceeds it, you split the rows across machines. This is horizontal partitioning, also called sharding: each row lives on exactly one shard, chosen by a partition key. Two strategies dominate, and they are the same two ideas as a hash index versus a B+tree.

Why two strategies, not one

A B+tree (week 6) keeps keys in order, so it answers range queries by descending once and walking linked leaves. A hash index (week 7) scatters keys for O(1) point lookups but cannot range, because hashing destroys order. Sharding inherits exactly this fork. Hash partitioning sends a row to hash(key) mod N, spreading load evenly but scattering adjacent keys across all shards. Range partitioning gives each shard a contiguous key interval, so a range query hits a contiguous set of shards, but monotonically increasing keys (timestamps, auto-increment ids) all land on the last shard and create a write hotspot.

hash partitioning range partitioning keys 1..12 (a range query) shard A 3 7 10 shard B 1 5 8 11 shard C 2 4 6 9 12 BETWEEN 4 AND 9 fans out to all three shards keys 1..12 (a range query) shard A 1 2 3 4 shard B 5 6 7 8 shard C 9 10 11 12 BETWEEN 4 AND 9 touches a contiguous A, B set even load, no ranges ← trade-off → ordered scans, monotonic-key hotspot
Figure 1. The same twelve keys under hash and range partitioning. Notice that the cost of a range query swaps with the risk of a hotspot. There is no layout that gives even spread and cheap ranges at once, which is the week 6 to 7 lesson restated for a cluster.
In real systems

Bigtable range-partitions every table by its lexicographic row key into tablets, each roughly 100 to 200 MB, auto-splitting as they grow, and locates them through a three-level structure (a Chubby file points to a root tablet, which points to METADATA tablets, which point to user tablets) that addresses up to 261 bytes of data [Bigtable, OSDI 2006, Section 5.1]. Because keys are sorted, the Webtable trick reverses hostnames so maps.google.com/index.html is stored as com.google.maps/index.html, putting one domain's pages in adjacent rows. PostgreSQL supports PARTITION BY RANGE, LIST, and HASH but only within a single server; cross-node sharding needs an extension or a fork [PG docs, Table Partitioning].

Replication: keep copies, then choose how many must agree

Partitioning spreads load but makes every shard a new single point of failure: lose its machine and you lose that slice of the table. So you keep copies. Replication serves two ends at once, surviving a node failure and scaling reads, but it raises a question a single node never had: when a write arrives, how many copies must store it before you tell the client "committed"?

The simplest scheme is leader/follower (also primary/replica or single-leader). One replica is the leader and takes all writes; it ships its change log to followers, which replay it. Reads can go to followers. The whole design rests on one knob.

Synchronous versus asynchronous replication is the durability/latency knob.
ModeCommit waits forOn leader crashCost
Synchronousa follower to acknowledgeno committed data losthigher commit latency
Asynchronousnothing, returns at oncelast writes can be loststale reads on followers

This is the same write-ahead durability promise from week 14, now stretched across machines. On one node, COMMIT forces the log to stable storage. Across nodes, synchronous commit forces the log to a follower too, so the promise survives the leader dying. PostgreSQL streaming replication exposes exactly this as synchronous_commit: with it on, "primary failure will never lose data"; with it off there is "no waiting for multiple servers" but "possible data loss during fail over" [PG docs, replication solutions].

Leaderless systems (the Dynamo lineage) drop the single leader and use quorums. With N replicas, a write must be acknowledged by W of them and a read must consult R of them. The one rule to remember:

Keep this one thing

If W + R > N, the write set and the read set must overlap in at least one replica, so a read is guaranteed to touch a node that has the latest acknowledged write. Tuning W and R slides you between write availability and read freshness without changing N.

Exam trap

Quorums do not have to be majorities. The overlap requirement is W + R > N, which N = 3, W = 3, R = 1 satisfies just as well as the common W = R = 2. The claim "quorum reads and writes always need a majority on each side" is false. A majority quorum is one valid point on the line, not the only one.

Two-phase commit: atomicity across nodes, and the blocking problem

Now take journey 2's UPDATE and make it span machines. A transfer debits account id 42 on shard A and credits id 99 on shard B. If each shard just commits locally, one could commit while the other aborts, and money vanishes or doubles. Atomicity must hold across the two shards: all commit or all abort. Two-phase commit (2PC) is the protocol that coordinates that all-or-nothing decision.

  1. Phase 1, prepare and vote. A coordinator asks every participant to prepare. A participant that votes yes must durably log a prepare record and promise it can commit no matter what. It can no longer abort on its own. It now sits in an uncertain, in-doubt state, holding its locks.
  2. Phase 2, commit or abort. If every participant voted yes, the coordinator logs and broadcasts commit; if any voted no, it broadcasts abort. Participants apply the decision and release their locks.
Why 2PC blocks, derived

The whole point of the prepare phase is that a yes-voter surrenders its right to abort, so the coordinator can later force commit and know every participant will comply. That surrender is also the trap. Suppose the coordinator crashes after participants prepared but before it delivers the decision. A prepared participant cannot decide alone: committing might break atomicity if the coordinator had chosen abort, and aborting might break it if the coordinator had chosen commit. So it waits, holding locks, until the coordinator recovers. This is the blocking problem. 2PC is correct but not available under a coordinator failure [PG prepared transactions].

2PC, coordinator alive coord P1 P2 prepare yes / yes commit, locks released coordinator crashes after prepare coord P1 P2 crash in-doubt in-doubt prepared, holding locks, cannot decide
Figure 2. Left, the happy path: prepare, all vote yes, commit. Right, the coordinator dies in the gap between the votes and the decision, and the prepared participants are stuck in-doubt holding locks. The fix is not a cleverer protocol but a coordinator that cannot lose its decision, which is what consensus buys.
Exam trap

"Two-phase commit guarantees the transaction never blocks" is false, and "three-phase commit is non-blocking on any network" is also false. 3PC inserts a pre-commit phase to reduce blocking, but its non-blocking property assumes a synchronous network with bounded message delay, which real asynchronous networks violate. That is why 3PC is rarely deployed and why production systems instead replicate the coordinator's decision log with consensus.

Consensus: replicate the coordinator so a minority can fail

The blocking problem comes from one fragile coordinator owning a decision. The principled fix is to make the coordinator's decision log itself replicated, so no single crash loses it. Consensus is how a set of nodes agree on an ordered log of commands despite crashes and message loss, as long as a majority survive.

Paxos (Lamport) is the foundational protocol; Multi-Paxos extends single-value agreement to a log. Raft (Ongaro and Ousterhout, 2014) was designed to be understandable and produces a result equivalent to Multi-Paxos. It splits consensus into leader election, log replication, and safety. One leader is elected per term; clients send commands to the leader; the leader appends to its log and replicates entries to followers; an entry is committed once a majority has stored it, after which it is applied to the state machine. The majority requirement guarantees any two committed decisions share a node, which keeps the agreed log consistent. A five-node Raft group survives two failures [Raft].

Keep this one thing

Put the 2PC coordinator's state behind a Raft group and the in-doubt block disappears: a single machine crash no longer loses the decision, because a majority of the Raft group still holds it. This is how etcd, CockroachDB, and Spanner-style systems make replication and distributed commit non-blocking.

The CAP theorem and PACELC

Partitioning, replication, and 2PC all assume the network mostly works. CAP names what happens when it does not. Eric Brewer conjectured it and Gilbert and Lynch proved it in 2002. It concerns three properties of a distributed store:

The three CAP properties. Consistency here is linearizability, not the C in ACID.
PropertyMeaning
Consistencylinearizability: every read returns the most recent completed write, or an error
Availabilityevery request to a non-failing node gets a non-error response, freshness not guaranteed
Partition tolerancethe system keeps operating when the network drops or delays messages between nodes

The theorem says you can have at most two of the three at once. The honest reading is not "pick any two at design time." Partitions are not optional, the network can always split, so partition tolerance is effectively mandatory for any real distributed store. The real choice is forced only during a partition: keep consistency and refuse some requests, or stay available and risk returning stale data. When there is no partition you can have both [CAP theorem, Gilbert-Lynch summary].

Exam trap

The most common CAP error: "the C in CAP is the same as the C in ACID." False. CAP consistency is linearizability, a property of distributed reads and writes. ACID consistency (week 12) is a transaction preserving integrity constraints. They are different concepts that happen to share a letter. An MCQ that conflates them is testing whether you noticed.

PACELC (Abadi, 2010) fixes what CAP leaves out. CAP only describes a partition. PACELC says: if there is a Partition, trade Availability against Consistency (the CAP case); Else, in normal operation, trade Latency against Consistency. The else-branch captures the cost CAP hides. Even with a healthy network, stronger consistency means waiting for synchronous replication or a quorum, which costs latency, and you can buy lower latency by relaxing consistency. A Dynamo-style store is PA/EL; a system that always prioritizes consistency is PC/EC. This is why the synchronous versus asynchronous knob from the replication table is the same trade-off as CAP's choice, just on a healthy network.

C (linearizability) A (availability) P (partition tolerance) during a partition: keep two, drop one PACELC P? → A vs C Else → Latency vs C
Figure 3. CAP forces a choice only on the partition edge. PACELC adds the else-branch: even when the network is healthy, consistency still costs latency. The synchronous replication knob lives in that else-branch.

Simulator: a partition sandbox with quorums and 2PC

This is the experiment to run. You have a five-node cluster with a leader, replication that is synchronous or asynchronous, and tunable quorum W and R. Cut the network and watch a CP setting refuse the minority side while an AP setting serves stale data. With no partition, watch synchronous commit add latency for nothing. Then switch on 2PC, kill the coordinator mid-protocol to see participants block in-doubt, and replace the coordinator with a Raft group to watch the block disappear.

CAP / PACELC partition sandbox with 2PC cut the network, tune the quorum, kill the coordinator
Try this in order: set mode to CP and W = 3, partition the network, then read from the minority side and watch it refuse. Switch to AP and read again to get a stale value. Heal, turn on synchronous replication, and watch write latency rise with no partition at all. Then run 2PC, crash the coordinator, and finally enable the Raft coordinator.

Bigtable: the whole picture in one production system

Bigtable is the bridge that makes this lesson concrete and connects it to last week. It is a sparse, distributed, persistent, multidimensional sorted map keyed by (row, column, time), and it composes everything here. The write path is the LSM design from week 15: a mutation goes to a per-tablet-server commit log (with group commit) and then into an in-memory sorted memtable, flushed to immutable SSTables and merged by minor, merging, and major compaction [Bigtable, Section 5.3 and 5.4].

On the distributed side, Bigtable shows every idea above living together. Tables are range-partitioned into tablets. Replication lives below Bigtable, not in its data plane: GFS replicates the commit log and SSTable files, and Chubby replicates coordination state via Paxos. Membership and failure detection use a lock service: each tablet server holds an exclusive Chubby lock on a uniquely named file, and the master detects a dead server, grabs its lock, deletes its file so it can never serve again (fencing), and reassigns its tablets. The master self-fences if its own Chubby session expires, and master death does not move tablets, so there is no split brain (Section 5.2). Notice the deliberate scope cut: Bigtable offers only single-row atomic operations, no general cross-row transactions, because most applications needed only single-row atomicity and distributed transactions are expensive (Section 3 and Section 9). That is the CAP and 2PC cost showing up as a product decision.

In real systems

The survey that opened this course (week 1) already named the cluster forms: shared-nothing partitions tables horizontally and still needs two-phase commit and distributed deadlock detection across nodes [Architecture of a Database System, Section 3.2]. Sixteen weeks later, those two sentences have unpacked into this whole lesson.

The four problems of going distributed, each rooted in a single-node topic you already know.
Distributed problemSingle-node ancestorWhat you trade
Partitioning (hash vs range)hash index vs B+tree (weeks 6, 7)even spread vs cheap ranges
Replication (sync vs async)WAL durability (week 14)commit latency vs data loss window
Two-phase commitatomicity of a transaction (week 12)atomicity across nodes vs blocking on coordinator loss
CAP during a partitionisolation and consistency (week 12)consistency vs availability

Check yourself

Which statement about the CAP theorem is false?

The false statement is the third. CAP consistency is linearizability, a property of distributed reads and writes. ACID consistency is a transaction preserving integrity constraints. They share a letter and nothing else, which is the trap most MCQs use.

A coordinator crashes after every participant has voted yes in two-phase commit. What happens?

A yes vote surrenders the right to abort, so a prepared participant cannot decide alone: committing might violate atomicity if the coordinator chose abort, aborting might violate it if it chose commit. It blocks in-doubt until the coordinator recovers. This is exactly the 2PC blocking problem.

Which claim about partitioning is false?

The first is false. Hashing destroys key order, so a range query under hash partitioning cannot localize and must scan all shards. This is the same reason a hash index cannot serve a range, just at cluster scale.

With N replicas, what does the quorum condition W + R > N guarantee?

W + R > N forces the read set and write set to overlap in at least one node, so a read touches a replica holding the latest acknowledged write. It does not require a majority on either side (N=3, W=3, R=1 satisfies it), and it says nothing about failure tolerance or staleness elsewhere.

Which statement about consensus and the 2PC blocking problem is false?

The third is false. 3PC's non-blocking property assumes a synchronous network with bounded message delay; on a real asynchronous network it can still fail to terminate. The practical fix is to replicate the coordinator's decision via consensus, which Raft and Multi-Paxos both provide with equivalent guarantees.

In Bigtable, how is a dead tablet server detected and prevented from serving again?

Each tablet server holds an exclusive Chubby lock on a uniquely named file. When a server is unreachable or has lost its lock, the master acquires that lock and deletes the file, which guarantees the old server can never serve again, then reassigns its tablets. The master also self-fences if its own Chubby session expires.

Why does two-phase commit block, and what removes the block?
A prepared participant has surrendered its right to abort, so if the coordinator crashes before delivering the decision it cannot decide alone and waits in-doubt holding locks. Replicating the coordinator's decision log with consensus (Raft or Multi-Paxos) removes the block: a majority still holds the decision, so a new leader resumes phase 2.
State PACELC in one sentence and say what it adds to CAP.
If there is a Partition, trade Availability against Consistency; Else, in normal operation, trade Latency against Consistency. It adds the else-branch CAP ignores: even on a healthy network, stronger consistency costs latency.
Viva prompt 1

Bigtable is a distributed database, yet the paper says it replicates nothing in its own data plane. So where do durability and cross-cluster availability come from, and what does that tell you about how Bigtable layers its system?

Model answer

Durability comes from below Bigtable. GFS stores both the commit log and the SSTable files and replicates them, and Chubby replicates its coordination state via Paxos across five replicas (Section 4). Bigtable itself does not copy data between tablet servers; a tablet is assigned to exactly one tablet server at a time. Cross-cluster availability is handled above the single-cluster data plane: Personalized Search first used client-side eventual-consistency replication and later a server-side replication subsystem, and the paper notes planned cross-data-center multi-master replication (Section 8.3 and Section 11).

The lesson is that Bigtable is composed, not monolithic. It pushes replication down into GFS and Chubby and keeps its own layer focused on the sorted-map data model, range-partitioned tablets, and the LSM write path. This is why a single tablet server is not a data-loss risk even though it is the only server for its tablets: the bytes are safe in GFS, and the master can reassign the tablets to another server, which rebuilds the memtable by replaying the commit log from the recorded redo points (Section 5.3).

Viva prompt 2

Walk the three-level tablet location hierarchy and justify two design choices in it: why the root tablet is never split, and why a stale client cache can cost up to six round-trips instead of three.

Model answer

The hierarchy has three levels (Section 5.1, Figure 4). Level one is a file in Chubby holding the location of the root tablet. The root tablet is the first tablet of a special METADATA table and holds the locations of all other METADATA tablets. Each other METADATA tablet holds the locations of a set of user tablets. A METADATA row keys a tablet by an encoding of (table identifier, end row) and holds about 1 KB in memory, so with 128 MB METADATA tablets the scheme addresses 234 tablets, i.e. 261 bytes.

The root tablet is never split precisely to cap the hierarchy at three levels. If it could split, you would need a fourth level to locate its pieces, and the addressing math and the bounded lookup cost both depend on staying at three.

A cold cache costs three round-trips, one of which is the Chubby read for the root location, then the root tablet, then the METADATA tablet. A stale cache can cost up to six because stale entries are only discovered on a miss: the client chases a wrong location, misses, and has to walk back up the hierarchy, doubling the worst-case path. The client library prefetches more than one tablet's metadata per METADATA read to make stale entries rarer.

Viva prompt 3

Bigtable supports only single-row atomic operations, with no general cross-row transactions in its base API. Connect that scope decision to two-phase commit and the CAP theorem. Was it a limitation or a deliberate trade?

Model answer

It was deliberate. Every read or write under a single row key is atomic across any number of columns, but cross-row writes can only be batched at the client (Section 2 and Section 3). The lessons section explains the reasoning: most applications needed only single-row atomicity, and the main demand for distributed transactions was secondary indexes, which they planned to serve with a narrower mechanism (Section 9).

The connection to this lesson is direct. A general cross-row transaction spanning tablets on different tablet servers would require two-phase commit, which brings the blocking problem: a coordinator crash after participants prepare leaves them in-doubt holding locks. It also collides with CAP, because during a network partition a cross-tablet transaction cannot both stay available and stay consistent. By restricting atomicity to a single row, which lives entirely on one tablet server, Bigtable keeps every atomic operation local and sidesteps distributed commit and the CAP choice for the common case. The price is pushed to the application, which must design its row keys so that data needing atomicity shares a row. That is the same kind of trade as choosing AP over CP: you do not escape the cost, you decide who pays it.

Primary source

Focus on Section 5: the three components (master, tablet servers, client library), the three-level tablet location hierarchy in 5.1, tablet assignment and Chubby-based failure detection in 5.2, and the commit-log plus memtable plus SSTable write path in 5.3. It is the one paper that shows the LSM write path of week 15 living inside a range-partitioned, lock-service-coordinated distributed system, which is exactly this lesson.

Ask your teacher

This is the last lesson, so it is also the place to test whether the islands joined up. Ask me to trace the SELECT or the UPDATE across a sharded, replicated cluster end to end, to quiz you on any of the four papers before the viva, or to draw the CAP and 2PC pictures differently until they click. I am your teacher for this course, not just a document.