Distributed databases: partitioning, replication, 2PC, CAP
Once data outgrows one machine and the network can drop messages, a transaction has to give something up. This last lesson takes the single-node engine you built over fifteen weeks and spreads it across machines, then names exactly what breaks and what you trade to fix it.
By the end you can
- Choose hash versus range partitioning and explain why one fans out range queries and the other hotspots.
- Reason about leader/follower and quorum replication, and read the W + R > N overlap rule.
- Walk two-phase commit, locate the in-doubt state, and explain why a coordinator crash blocks.
- State the CAP theorem precisely (consistency means linearizability, not the C in ACID) and add the PACELC else-branch.
- Explain how consensus (Raft, equivalent to Multi-Paxos) replicates the coordinator log so a minority can fail without blocking.
Hold the two journeys of this course in mind. Journey 1 sent
SELECT name FROM emp WHERE salary > 75000 ORDER BY name down the stack: parser
(week 8), optimizer (week 11),
executor (weeks 9 and 10),
index (weeks 6 and 7),
buffer pool (weeks 4 and 5),
storage (weeks 2 and 3).
Journey 2 sent UPDATE account SET balance = balance - 100 WHERE id = 42 through
concurrency control (week 13) and the write-ahead log
(week 14). Every one of those layers assumed a single
machine. Last week (week 15) we kept the single machine
but changed its storage shape to the log-structured merge tree.
This week does not add a layer below storage. It takes the whole engine and copies it onto many
machines connected by a network that can drop and delay messages. Three new problems appear and each
maps back onto something you already know. Splitting emp across machines is the hash
versus B+tree trade-off from weeks 6 and 7, now at cluster scale. Keeping copies of a partition is
durability (week 14) extended across nodes. Committing
the UPDATE when account rows for two transfers live on two machines is
atomicity (week 12) extended across nodes, and that is
where the network can hurt you. The layer above hands in a transaction or a query; this layer decides
which machines hold the rows, how many copies agree before a write returns, and what answer a read
gets when the network splits the cluster in two.
On one machine, "did the write happen" has a yes or no answer the moment fsync returns. Across machines, the question becomes "did enough of them agree, and can they still talk to each other." A dropped message turns a local fact into a negotiation. Distributed databases are mostly about what you do when that negotiation cannot finish in time.
Partitioning: split the rows by hash or by range
One node has finite disk, finite memory, and one failure domain. When the table or the request rate exceeds it, you split the rows across machines. This is horizontal partitioning, also called sharding: each row lives on exactly one shard, chosen by a partition key. Two strategies dominate, and they are the same two ideas as a hash index versus a B+tree.
A B+tree (week 6) keeps keys in order, so it answers
range queries by descending once and walking linked leaves. A hash index
(week 7) scatters keys for O(1) point lookups but cannot
range, because hashing destroys order. Sharding inherits exactly this fork. Hash partitioning sends
a row to hash(key) mod N, spreading load evenly but scattering adjacent keys across all
shards. Range partitioning gives each shard a contiguous key interval, so a range query hits a
contiguous set of shards, but monotonically increasing keys (timestamps, auto-increment ids) all
land on the last shard and create a write hotspot.
Bigtable range-partitions every table by its lexicographic row key into tablets, each
roughly 100 to 200 MB, auto-splitting as they grow, and locates them through a three-level structure
(a Chubby file points to a root tablet, which points to METADATA tablets, which point to user tablets)
that addresses up to 261 bytes of data
[Bigtable, OSDI 2006, Section 5.1].
Because keys are sorted, the Webtable trick reverses hostnames so maps.google.com/index.html
is stored as com.google.maps/index.html, putting one domain's pages in adjacent rows.
PostgreSQL supports PARTITION BY RANGE, LIST, and HASH but only
within a single server; cross-node sharding needs an extension or a fork
[PG docs, Table Partitioning].
Replication: keep copies, then choose how many must agree
Partitioning spreads load but makes every shard a new single point of failure: lose its machine and you lose that slice of the table. So you keep copies. Replication serves two ends at once, surviving a node failure and scaling reads, but it raises a question a single node never had: when a write arrives, how many copies must store it before you tell the client "committed"?
The simplest scheme is leader/follower (also primary/replica or single-leader). One replica is the leader and takes all writes; it ships its change log to followers, which replay it. Reads can go to followers. The whole design rests on one knob.
| Mode | Commit waits for | On leader crash | Cost |
|---|---|---|---|
| Synchronous | a follower to acknowledge | no committed data lost | higher commit latency |
| Asynchronous | nothing, returns at once | last writes can be lost | stale reads on followers |
This is the same write-ahead durability promise from week
14, now stretched across machines. On one node, COMMIT forces the log to stable storage. Across
nodes, synchronous commit forces the log to a follower too, so the promise survives the leader dying.
PostgreSQL streaming replication exposes exactly this as synchronous_commit: with it on,
"primary failure will never lose data"; with it off there is "no waiting for multiple servers" but
"possible data loss during fail over"
[PG docs, replication solutions].
Leaderless systems (the Dynamo lineage) drop the single leader and use quorums. With N replicas, a write must be acknowledged by W of them and a read must consult R of them. The one rule to remember:
If W + R > N, the write set and the read set must overlap in at least one replica, so a read is guaranteed to touch a node that has the latest acknowledged write. Tuning W and R slides you between write availability and read freshness without changing N.
Quorums do not have to be majorities. The overlap requirement is W + R > N, which
N = 3, W = 3, R = 1 satisfies just as well as the common W = R = 2. The
claim "quorum reads and writes always need a majority on each side" is false. A majority quorum is one
valid point on the line, not the only one.
Two-phase commit: atomicity across nodes, and the blocking problem
Now take journey 2's UPDATE and make it span machines. A transfer debits
account id 42 on shard A and credits id 99 on shard B. If each shard just commits locally,
one could commit while the other aborts, and money vanishes or doubles. Atomicity must hold across the
two shards: all commit or all abort. Two-phase commit (2PC) is the protocol that coordinates that
all-or-nothing decision.
- Phase 1, prepare and vote. A coordinator asks every participant to prepare. A participant that votes yes must durably log a prepare record and promise it can commit no matter what. It can no longer abort on its own. It now sits in an uncertain, in-doubt state, holding its locks.
- Phase 2, commit or abort. If every participant voted yes, the coordinator logs and broadcasts commit; if any voted no, it broadcasts abort. Participants apply the decision and release their locks.
The whole point of the prepare phase is that a yes-voter surrenders its right to abort, so the coordinator can later force commit and know every participant will comply. That surrender is also the trap. Suppose the coordinator crashes after participants prepared but before it delivers the decision. A prepared participant cannot decide alone: committing might break atomicity if the coordinator had chosen abort, and aborting might break it if the coordinator had chosen commit. So it waits, holding locks, until the coordinator recovers. This is the blocking problem. 2PC is correct but not available under a coordinator failure [PG prepared transactions].
"Two-phase commit guarantees the transaction never blocks" is false, and "three-phase commit is non-blocking on any network" is also false. 3PC inserts a pre-commit phase to reduce blocking, but its non-blocking property assumes a synchronous network with bounded message delay, which real asynchronous networks violate. That is why 3PC is rarely deployed and why production systems instead replicate the coordinator's decision log with consensus.
Consensus: replicate the coordinator so a minority can fail
The blocking problem comes from one fragile coordinator owning a decision. The principled fix is to make the coordinator's decision log itself replicated, so no single crash loses it. Consensus is how a set of nodes agree on an ordered log of commands despite crashes and message loss, as long as a majority survive.
Paxos (Lamport) is the foundational protocol; Multi-Paxos extends single-value agreement to a log. Raft (Ongaro and Ousterhout, 2014) was designed to be understandable and produces a result equivalent to Multi-Paxos. It splits consensus into leader election, log replication, and safety. One leader is elected per term; clients send commands to the leader; the leader appends to its log and replicates entries to followers; an entry is committed once a majority has stored it, after which it is applied to the state machine. The majority requirement guarantees any two committed decisions share a node, which keeps the agreed log consistent. A five-node Raft group survives two failures [Raft].
Put the 2PC coordinator's state behind a Raft group and the in-doubt block disappears: a single machine crash no longer loses the decision, because a majority of the Raft group still holds it. This is how etcd, CockroachDB, and Spanner-style systems make replication and distributed commit non-blocking.
The CAP theorem and PACELC
Partitioning, replication, and 2PC all assume the network mostly works. CAP names what happens when it does not. Eric Brewer conjectured it and Gilbert and Lynch proved it in 2002. It concerns three properties of a distributed store:
| Property | Meaning |
|---|---|
| Consistency | linearizability: every read returns the most recent completed write, or an error |
| Availability | every request to a non-failing node gets a non-error response, freshness not guaranteed |
| Partition tolerance | the system keeps operating when the network drops or delays messages between nodes |
The theorem says you can have at most two of the three at once. The honest reading is not "pick any two at design time." Partitions are not optional, the network can always split, so partition tolerance is effectively mandatory for any real distributed store. The real choice is forced only during a partition: keep consistency and refuse some requests, or stay available and risk returning stale data. When there is no partition you can have both [CAP theorem, Gilbert-Lynch summary].
The most common CAP error: "the C in CAP is the same as the C in ACID." False. CAP consistency is linearizability, a property of distributed reads and writes. ACID consistency (week 12) is a transaction preserving integrity constraints. They are different concepts that happen to share a letter. An MCQ that conflates them is testing whether you noticed.
PACELC (Abadi, 2010) fixes what CAP leaves out. CAP only describes a partition. PACELC says: if there is a Partition, trade Availability against Consistency (the CAP case); Else, in normal operation, trade Latency against Consistency. The else-branch captures the cost CAP hides. Even with a healthy network, stronger consistency means waiting for synchronous replication or a quorum, which costs latency, and you can buy lower latency by relaxing consistency. A Dynamo-style store is PA/EL; a system that always prioritizes consistency is PC/EC. This is why the synchronous versus asynchronous knob from the replication table is the same trade-off as CAP's choice, just on a healthy network.
Simulator: a partition sandbox with quorums and 2PC
This is the experiment to run. You have a five-node cluster with a leader, replication that is synchronous or asynchronous, and tunable quorum W and R. Cut the network and watch a CP setting refuse the minority side while an AP setting serves stale data. With no partition, watch synchronous commit add latency for nothing. Then switch on 2PC, kill the coordinator mid-protocol to see participants block in-doubt, and replace the coordinator with a Raft group to watch the block disappear.
Bigtable: the whole picture in one production system
Bigtable is the bridge that makes this lesson concrete and connects it to last week. It is a sparse,
distributed, persistent, multidimensional sorted map keyed by (row, column, time), and it
composes everything here. The write path is the LSM design from
week 15: a mutation goes to a per-tablet-server commit
log (with group commit) and then into an in-memory sorted memtable, flushed to immutable SSTables and
merged by minor, merging, and major compaction
[Bigtable, Section 5.3 and 5.4].
On the distributed side, Bigtable shows every idea above living together. Tables are range-partitioned into tablets. Replication lives below Bigtable, not in its data plane: GFS replicates the commit log and SSTable files, and Chubby replicates coordination state via Paxos. Membership and failure detection use a lock service: each tablet server holds an exclusive Chubby lock on a uniquely named file, and the master detects a dead server, grabs its lock, deletes its file so it can never serve again (fencing), and reassigns its tablets. The master self-fences if its own Chubby session expires, and master death does not move tablets, so there is no split brain (Section 5.2). Notice the deliberate scope cut: Bigtable offers only single-row atomic operations, no general cross-row transactions, because most applications needed only single-row atomicity and distributed transactions are expensive (Section 3 and Section 9). That is the CAP and 2PC cost showing up as a product decision.
The survey that opened this course (week 1) already named the cluster forms: shared-nothing partitions tables horizontally and still needs two-phase commit and distributed deadlock detection across nodes [Architecture of a Database System, Section 3.2]. Sixteen weeks later, those two sentences have unpacked into this whole lesson.
| Distributed problem | Single-node ancestor | What you trade |
|---|---|---|
| Partitioning (hash vs range) | hash index vs B+tree (weeks 6, 7) | even spread vs cheap ranges |
| Replication (sync vs async) | WAL durability (week 14) | commit latency vs data loss window |
| Two-phase commit | atomicity of a transaction (week 12) | atomicity across nodes vs blocking on coordinator loss |
| CAP during a partition | isolation and consistency (week 12) | consistency vs availability |
Check yourself
Which statement about the CAP theorem is false?
The false statement is the third. CAP consistency is linearizability, a property of distributed reads and writes. ACID consistency is a transaction preserving integrity constraints. They share a letter and nothing else, which is the trap most MCQs use.
A coordinator crashes after every participant has voted yes in two-phase commit. What happens?
A yes vote surrenders the right to abort, so a prepared participant cannot decide alone: committing might violate atomicity if the coordinator chose abort, aborting might violate it if it chose commit. It blocks in-doubt until the coordinator recovers. This is exactly the 2PC blocking problem.
Which claim about partitioning is false?
The first is false. Hashing destroys key order, so a range query under hash partitioning cannot localize and must scan all shards. This is the same reason a hash index cannot serve a range, just at cluster scale.
With N replicas, what does the quorum condition W + R > N guarantee?
W + R > N forces the read set and write set to overlap in at least one node, so a read touches a replica holding the latest acknowledged write. It does not require a majority on either side (N=3, W=3, R=1 satisfies it), and it says nothing about failure tolerance or staleness elsewhere.
Which statement about consensus and the 2PC blocking problem is false?
The third is false. 3PC's non-blocking property assumes a synchronous network with bounded message delay; on a real asynchronous network it can still fail to terminate. The practical fix is to replicate the coordinator's decision via consensus, which Raft and Multi-Paxos both provide with equivalent guarantees.
In Bigtable, how is a dead tablet server detected and prevented from serving again?
Each tablet server holds an exclusive Chubby lock on a uniquely named file. When a server is unreachable or has lost its lock, the master acquires that lock and deletes the file, which guarantees the old server can never serve again, then reassigns its tablets. The master also self-fences if its own Chubby session expires.
Bigtable is a distributed database, yet the paper says it replicates nothing in its own data plane. So where do durability and cross-cluster availability come from, and what does that tell you about how Bigtable layers its system?
Model answer
Durability comes from below Bigtable. GFS stores both the commit log and the SSTable files and replicates them, and Chubby replicates its coordination state via Paxos across five replicas (Section 4). Bigtable itself does not copy data between tablet servers; a tablet is assigned to exactly one tablet server at a time. Cross-cluster availability is handled above the single-cluster data plane: Personalized Search first used client-side eventual-consistency replication and later a server-side replication subsystem, and the paper notes planned cross-data-center multi-master replication (Section 8.3 and Section 11).
The lesson is that Bigtable is composed, not monolithic. It pushes replication down into GFS and Chubby and keeps its own layer focused on the sorted-map data model, range-partitioned tablets, and the LSM write path. This is why a single tablet server is not a data-loss risk even though it is the only server for its tablets: the bytes are safe in GFS, and the master can reassign the tablets to another server, which rebuilds the memtable by replaying the commit log from the recorded redo points (Section 5.3).
Walk the three-level tablet location hierarchy and justify two design choices in it: why the root tablet is never split, and why a stale client cache can cost up to six round-trips instead of three.
Model answer
The hierarchy has three levels (Section 5.1, Figure 4). Level one is a file in Chubby holding the location of the root tablet. The root tablet is the first tablet of a special METADATA table and holds the locations of all other METADATA tablets. Each other METADATA tablet holds the locations of a set of user tablets. A METADATA row keys a tablet by an encoding of (table identifier, end row) and holds about 1 KB in memory, so with 128 MB METADATA tablets the scheme addresses 234 tablets, i.e. 261 bytes.
The root tablet is never split precisely to cap the hierarchy at three levels. If it could split, you would need a fourth level to locate its pieces, and the addressing math and the bounded lookup cost both depend on staying at three.
A cold cache costs three round-trips, one of which is the Chubby read for the root location, then the root tablet, then the METADATA tablet. A stale cache can cost up to six because stale entries are only discovered on a miss: the client chases a wrong location, misses, and has to walk back up the hierarchy, doubling the worst-case path. The client library prefetches more than one tablet's metadata per METADATA read to make stale entries rarer.
Bigtable supports only single-row atomic operations, with no general cross-row transactions in its base API. Connect that scope decision to two-phase commit and the CAP theorem. Was it a limitation or a deliberate trade?
Model answer
It was deliberate. Every read or write under a single row key is atomic across any number of columns, but cross-row writes can only be batched at the client (Section 2 and Section 3). The lessons section explains the reasoning: most applications needed only single-row atomicity, and the main demand for distributed transactions was secondary indexes, which they planned to serve with a narrower mechanism (Section 9).
The connection to this lesson is direct. A general cross-row transaction spanning tablets on different tablet servers would require two-phase commit, which brings the blocking problem: a coordinator crash after participants prepare leaves them in-doubt holding locks. It also collides with CAP, because during a network partition a cross-tablet transaction cannot both stay available and stay consistent. By restricting atomicity to a single row, which lives entirely on one tablet server, Bigtable keeps every atomic operation local and sidesteps distributed commit and the CAP choice for the common case. The price is pushed to the application, which must design its row keys so that data needing atomicity shares a row. That is the same kind of trade as choosing AP over CP: you do not escape the cost, you decide who pays it.
Focus on Section 5: the three components (master, tablet servers, client library), the three-level tablet location hierarchy in 5.1, tablet assignment and Chubby-based failure detection in 5.2, and the commit-log plus memtable plus SSTable write path in 5.3. It is the one paper that shows the LSM write path of week 15 living inside a range-partitioned, lock-service-coordinated distributed system, which is exactly this lesson.
This is the last lesson, so it is also the place to test whether the islands joined up. Ask me to
trace the SELECT or the UPDATE across a sharded, replicated cluster end to
end, to quiz you on any of the four papers before the viva, or to draw the CAP and 2PC pictures
differently until they click. I am your teacher for this course, not just a document.