Field Guide / Part VII · Recovery / Week 14
Recovery

Logging and recovery: WAL and ARIES

The machine can lose power between any two instructions. Yet a committed transfer must survive and a half-finished one must vanish. This lesson shows how a single append-only log, plus three careful restart passes, turns a disk in any state back into a correct database.

By the end you can

  • Explain why steal and no-force buffer policies make undo and redo both necessary, and state the WAL rule that makes them safe.
  • Read a log of update, commit, and compensation records and say what each LSN means.
  • Walk the three ARIES restart passes (Analysis, Redo, Undo) and say what each one produces.
  • Explain why redo is conditional on page_LSN but undo is unconditional, and why a CLR is never undone.
  • Answer the exam traps: WAL ordering, the steal/no-force to undo/redo mapping, and repeating history.

We are on the second journey of the course, the write path. Last week (strict 2PL, deadlock, and MVCC) the transaction earned the right to modify a page: it took an exclusive lock or wrote a new version, and the page now sits dirty in the buffer pool. This week answers the question that journey never had to ask on a pure read: what happens when the lights go out before that dirty page reaches disk, or after it reaches disk but before the transaction commits. The layer above hands us a page about to change. The layer below (the buffer pool) is allowed to write that page to disk whenever it likes, even before commit, and is allowed to keep it in memory long after commit. Those two freedoms are the whole problem. We hand down a rule the buffer pool must obey (write the log first) and a log it can replay to fix any mess a crash leaves behind. Next week (LSM trees and columnar storage) reuses this exact write-ahead log idea in a new shape: the memtable plus WAL of a log-structured store.

Keep the running transaction in view:

BEGIN;
UPDATE account SET balance = balance - 100 WHERE id = 42;
COMMIT;
Intuition

You are editing a long document and the power is unreliable. Editing the file in place is dangerous: a crash mid-save leaves it corrupt. So before you touch the file, you jot every intended change into a notebook, in order, and you never let the file get ahead of the notebook. After a crash you reopen the notebook and replay it. Anything you marked done that the file is missing, you redo. Anything you started but never marked committed, you cross out. The notebook is the log. The discipline of writing it first is write-ahead logging.

Why a crash can leave disk in any state

A transaction is one logical step built from many physical page writes. Two buffer-pool freedoms from weeks 4 and 5 decide how bad a crash can be.

Steal. The buffer pool may evict a page dirtied by an uncommitted transaction, writing it to disk to free the frame. No-steal would forbid this, but then a long transaction could pin the whole pool. So real systems steal. The consequence: at crash time, disk may hold changes from a transaction that never committed. Recovery must be able to undo them.

No-force. At commit, the buffer pool is not required to flush that transaction's dirty pages to disk. Force would make every commit a random write storm. So real systems use no-force. The consequence: at crash time, disk may be missing changes from a transaction that did commit. Recovery must be able to redo them.

Steal plus no-force is the fast combination, and it is the one ARIES targets [Mohan et al. 1992, Section 4]. The price is that both undo and redo become necessary. The naive alternative, shadow paging, sidesteps the problem by writing each new page version to a fresh location and keeping the old one as the recovery copy. It avoids undo and redo, but it scatters pages, wastes space, and perturbs every checkpoint and image copy [Mohan et al. 1992, Section 1.1]. In-place update with a log is the design that won.

Why a log and not the data file

The data file is large, random-access, and updated in place, so you cannot make a write to it atomic at the granularity a transaction needs. The log is append-only and sequential, so forcing it to stable storage is cheap and a single record either landed or did not. By making the log the source of truth and the data file a lazily-updated cache of it, you move the atomicity problem to the one place where it is easy: appending one record to the end of a sequential file.

buffer pool dirty page P, LSN 30 data file on disk page P log on stable storage ... LSN 30 forced 1. force log up to page_LSN 2. only then write page never write the page first
Figure 1. The WAL ordering rule. A dirty page may not pass its own log record on the way to disk. The buffer manager forces the log up to the page's page_LSN before stealing the page. Notice the log write is the gate, not an afterthought.

Write-ahead logging and the page_LSN invariant

Every log record gets a log sequence number, the LSN, a monotonically increasing identifier. Each page on disk carries a field called the page_LSN, defined as the LSN of the log record that describes the most recent update applied to that page [Mohan et al. 1992, Section 4.2]. That one field is what lets recovery line up a page with the log. If a page's page_LSN is 30, then every logged change up to and including LSN 30 is already baked into the page, and nothing after it is. This is the bookkeeping that makes redo safe: you never reapply an effect that is already present.

The WAL protocol has two parts [Mohan et al. 1992, Section 1.1]:

  1. Before a changed page is written to the durable data file, the log records describing its changes (at least the undo portions) must already be on stable storage. The buffer manager enforces this, forcing the log up to the page's page_LSN before it steals the page.
  2. A transaction is not committed until its commit record, and all of its log up to that LSN, are forced to stable storage. Commit is exactly the moment the log says so.
Keep this one thing

The log is the truth. The data file is a stale cache of the log. Two rules keep them honest: the log record reaches disk before the page it describes (so undo is always possible), and the commit record reaches disk at commit (so durability is real). Everything in ARIES exists to exploit those two rules.

Each transaction's records are chained backward by a PrevLSN field, so you can walk one transaction's history from its latest record to its first. Compensation records, written during undo, carry an extra field, UndoNxtLSN, which we reach below.

The three principles of ARIES

ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) rests on three ideas [Mohan et al. 1992, abstract and Section 3].

  1. Write-ahead logging. The log record reaches stable storage before the data page, and the commit record reaches it at commit. Already covered.
  2. Repeating history during redo. On restart, before undoing anything, ARIES replays every logged update whose effect did not reach disk, for all transactions including the losers. This rebuilds the exact database state as of the crash, and only then are losers rolled back. This is the move that distinguishes ARIES from every method before it.
  3. Logging changes during undo. When ARIES undoes an update, it writes that undo as a Compensation Log Record (CLR). CLRs are redo-only: they are never themselves undone. A CLR carries UndoNxtLSN, pointing at the next record of that transaction still to be undone, so rollback is bounded and can resume after a crash without redoing undo work.
Why repeat history, even for losers

It looks wasteful: ARIES redoes a doomed transaction's updates during redo, then immediately undoes them. The point is precision. Earlier methods tried to be clever and skip a loser's redo, which forced them into bugs: undoing an update that was never on the page, or undoing then redoing a committed change already on disk. By first restoring the exact crash-time state (every logged effect present), the undo pass faces a known, consistent world. Every loser update is provably present, so undo never has to ask whether it is. The cost is a few redundant page touches; the gain is correctness with fine-grained locking and operation logging [Mohan et al. 1992, Section 2].

LSN 10 update P LSN 20 update Q LSN 30 update R LSN 40 CLR undo of 30 PrevLSN chain (backward) UndoNxtLSN: skip 30, resume undo at 20
Figure 2. A transaction's records chain backward by PrevLSN. The CLR at LSN 40 records the undo of LSN 30 and sets UndoNxtLSN to 20, so undo jumps straight past the record it just compensated. If a crash hits after LSN 40, recovery resumes undo at 20 and never re-undoes 30.

The two tables recovery rebuilds

ARIES keeps two in-memory tables during normal running, writes them into checkpoint records, and rebuilds them at restart [Mohan et al. 1992, Sections 4.3, 4.4].

The state recovery tracks. Both are reconstructed by the Analysis pass.
TablePer-entry fieldsWhat it answers
Transaction TableTransID, State, LastLSN, UndoNxtLSNWhich transactions were live, and where to start undoing each.
Dirty Pages TablePageID, RecLSNWhich pages might have changes not yet on disk, and from what log point.

RecLSN is the subtle one. When a clean page is first dirtied in the buffer, its RecLSN is set to the current end-of-log LSN, the LSN of the next record to be written. RecLSN marks the earliest log point whose updates to that page might not be on disk yet. The minimum RecLSN over the whole Dirty Pages Table is RedoLSN, the point where the redo pass must begin: any change logged before that is guaranteed already on disk.

Exam trap

RecLSN is set when the page is first dirtied, not when it is written to disk, and it is the end-of-log LSN at that instant (the next record to be written), not the LSN of the update that dirtied it. Mixing these up makes you compute RedoLSN wrong and start redo too late, missing changes the crash actually lost.

Simulator: the ARIES log replayer

Below is a small but real ARIES engine. The left panel is a log you can build: append updates and commits for two transactions, then crash at a chosen position. The right panels show the on-disk page_LSN values, the Transaction Table, and the Dirty Pages Table. Run Analysis, then Redo, then Undo, one step at a time, and watch the engine reconstruct the crash-time state and then roll back the losers, writing CLRs as it goes.

ARIES log replayer build a log, crash, then replay Analysis / Redo / Undo
Log (stable storage). The crash point is the red line.
On-disk page_LSN
Transaction Table
Dirty Pages Table
The default log commits T1 but leaves a second writer of page P uncommitted at the crash. T1 is a winner, the second writer is a loser, and both touched P. Watch Redo bring back both changes to P, then Undo reverse only the loser's and write a CLR.

The three restart passes

On restart, ARIES makes exactly three passes over the log [Mohan et al. 1992, Section 6].

checkpoint RedoLSN end of log (crash) Analysis rebuild tables, find losers, compute RedoLSN Redo repeat history (conditional on page_LSN) Undo roll back losers backward, write CLRs (unconditional)
Figure 3. Analysis scans forward from the last checkpoint. Redo scans forward from RedoLSN. Undo scans backward over losers. Redo and Analysis go the same direction; only Undo reverses.

1. Analysis

Scan forward from the most recent checkpoint to the end of the log. Rebuild the Transaction Table and Dirty Pages Table. Identify the losers, the transactions that had neither committed nor reached the in-doubt (prepared) state of two-phase commit. Compute RedoLSN as the minimum RecLSN across the Dirty Pages Table. Analysis produces the worklist for the next two passes.

2. Redo

Scan forward from RedoLSN, repeating history. For each redoable record whose page is in the Dirty Pages Table, fetch the page and compare. Redo is conditional: reapply the change only if the page's page_LSN is less than the record's LSN, then set page_LSN to the record's LSN. If page_LSN already covers the record, the change is on disk and is skipped. Nothing is logged during redo, because redo only repeats history that is already logged. After redo, the database is exactly as it was the instant before the crash, losers and all.

3. Undo

Roll back every loser in reverse chronological order, in one sweep, by repeatedly taking the largest not-yet-undone LSN across the losers. For each loser update, apply the inverse and write a CLR. Undo is unconditional: it does not consult page_LSN, because repeating history already guaranteed every loser update is present. When the sweep meets a CLR, it jumps to that CLR's UndoNxtLSN, skipping work already undone. When no loser has any record left to undo, recovery is complete.

Why redo is conditional but undo is unconditional

Redo runs against a disk in an unknown state: some changes reached it, some did not, and page_LSN is the only way to tell which. So redo must check before acting. Undo runs after redo has restored a fully known state, so it does not have to check: every loser change is provably present, and undo simply reverses each. The asymmetry is not arbitrary; it falls directly out of repeating history [Mohan et al. 1992, Sections 6.2, 6.3].

# ARIES restart, in three passes (pseudocode)
def restart(log, checkpoint):
    txn_tab, dpt = analysis(log, checkpoint)      # rebuild tables, find losers
    redo_lsn = min(e.rec_lsn for e in dpt)     # earliest possibly-lost change

    for rec in log.forward_from(redo_lsn):     # REDO: repeat history
        page = fetch(rec.page)
        if page.page_lsn < rec.lsn:               # conditional
            apply(rec, page); page.page_lsn = rec.lsn

    for rec in log.backward_over(losers):       # UNDO: roll back losers
        if rec.is_clr: continue                   # CLRs are never undone
        clr = write_clr(rec, undo_nxt=rec.prev_lsn)
        apply_inverse(rec)                       # unconditional
In real systems

PostgreSQL uses physiological WAL records with LSNs stamped into every page (pd_lsn in the page header), and its checkpointer plus background writer enforce the WAL ordering before a dirty buffer is written [PostgreSQL, Reliability and WAL]. PostgreSQL does not write per-action CLRs the way ARIES does, because its MVCC design makes most rollbacks a matter of marking tuple versions dead rather than physically reversing pages; abort just stamps the transaction aborted and lets VACUUM reclaim the versions (week 13). SQLite offers a rollback journal (copy the old page out before changing it, the no-steal-style undo log) and a WAL mode (append new page images to a side file, the redo-style approach) [SQLite, Write-Ahead Logging]. The two modes are a clean view of the undo-versus-redo split this lesson is built on.

Prompt: in one sentence, what invariant does the page_LSN field guarantee?
Answer: a page reflects exactly the logged changes up to and including its page_LSN and none after it, so redo can tell at a glance whether a logged change is already on the page.
Prompt: steal forces which capability, and no-force forces which?
Answer: steal forces undo (uncommitted changes may already be on disk), no-force forces redo (committed changes may not yet be on disk).

Paper viva: ARIES (Mohan et al., 1992)

This is a paper week. The examiner will press on the parts of ARIES that are easy to half-remember. Answer out loud before opening each model answer.

Viva prompt 1

State the write-ahead logging protocol precisely, and explain why the steal and no-force buffer policies are exactly what make undo and redo necessary at restart.

Model answer

WAL has two clauses. First, before a dirty page is written to the durable database, the log records describing its changes, at least the undo portions, must already be on stable storage; the buffer manager enforces this, sometimes forcing the log up to the page's page_LSN before stealing the page. Second, a transaction is not committed until its commit record and all earlier log up to that LSN are forced to stable storage [Mohan et al. 1992, Section 1.1]. Steal lets an uncommitted transaction's dirty page reach disk before commit, so a crash can leave uncommitted changes on disk, which is why undo capability is required. No-force lets a committed transaction's dirty pages stay in the buffer, so a crash can leave committed changes missing from disk, which is why redo capability is required. The first WAL clause makes undo possible (the old value is logged before the page moves); the second makes durability real (commit is the moment the log says so).

Viva prompt 2

What is repeating history, and why does ARIES redo even the losers' updates rather than skipping them? What concrete bugs in earlier methods does this avoid?

Model answer

Repeating history means the redo pass reapplies every logged update whose effect did not reach disk, for all transactions including the losers, before any undo runs. This reestablishes the exact database state as of the failure [Mohan et al. 1992, Section 3]. Redoing the losers looks wasteful, since they are about to be undone, but it guarantees a precise, fully known state at the start of undo. Earlier WAL methods that tried to skip loser redo fell into the bugs ARIES lists in Section 2: undoing an update that was never actually present on the page, undoing the same logged action more than once, or undoing a committed update that had already reached disk and then redoing it. By making every loser update provably present before undo, ARIES makes undo unconditional and sound, and it is what allows fine-granularity locking together with operation logging, where you must never redo a present effect or undo an absent one.

Viva prompt 3

What is a Compensation Log Record, why is it redo-only, and how does its UndoNxtLSN keep the amount of logging bounded under repeated or nested rollbacks?

Model answer

A CLR is the log record written when ARIES undoes an update; it describes the undo that was performed [Mohan et al. 1992, Section 3]. It is redo-only because it is never itself undone: once undo work is logged it is permanent, and rollback proceeds forward through the remaining records to undo. Each CLR carries UndoNxtLSN, pointing to the predecessor of the record it just compensated, that is, the next record of that transaction still to be undone. So the UndoNxtLSN of the most recently written CLR records exactly how far rollback has progressed. If a system failure or a nested rollback interrupts undo, recovery resumes at that UndoNxtLSN and skips every record already compensated, so no record is undone twice and no extra CLRs are generated for already-undone work. That is what bounds logging during rollback, in contrast to IMS, which could undo the same non-CLR record multiple times, generating unbounded log. Because the CLR records what the undo actually did, the undo action also need not be the physical inverse of the original on the same page, which is what enables logical undo and the high concurrency of record-level operations.

Check yourself

Which statement about write-ahead logging is false?

WAL means the log goes first. The describing log record (at least its undo portion) must be on stable storage before the dirty page it describes is written to the data file, so option C states the rule backwards and is the false one. The other three are the genuine clauses: commit forces the log, stealing a page forces the log up to its page_LSN, and undo information is logged before the page moves.

Which mapping from buffer policy to recovery capability is correct?

Steal allows an uncommitted dirty page to reach disk, so its changes must be reversible: undo. No-force allows a committed transaction's pages to stay in memory, so their changes must be replayable: redo. ARIES targets steal plus no-force precisely because it is the fast combination, and it pays for that speed with both undo and redo. Options A and B swap the two; option D describes the slow combination that ARIES does not require.

During the redo pass, when does ARIES actually reapply a logged update to a page?

Redo is conditional on page_LSN. If page_LSN is below the record's LSN, the change has not yet reached this page, so reapply it and advance page_LSN. If page_LSN already covers the record, the change is on disk and is skipped. Option A drops the condition; option D reverses the comparison; options C and E miss that history is repeated for all transactions, losers included, before any undo.

Which statement about Compensation Log Records is false?

CLRs are redo-only and are never undone. Once an undo is logged as a CLR it is permanent; rollback continues forward through the remaining records by following UndoNxtLSN. That is exactly what bounds logging under nested or repeated rollbacks. The other three statements are all true descriptions of how CLRs work.

How is RedoLSN, the starting point of the redo pass, computed?

RecLSN for a page is the end-of-log LSN at the moment that page was first dirtied, the earliest log point whose updates to it might not be on disk. The minimum RecLSN across the whole Dirty Pages Table is therefore the earliest point at which redo could matter, so redo begins there. Starting at the checkpoint LSN would redo too much; starting later would miss lost changes.

Why is undo unconditional while redo is conditional?

Repeating history during redo rebuilds the precise crash-time state, so by the time undo runs, every loser update is provably present on its page. Undo can therefore reverse each one without checking. Redo, by contrast, faces a disk in an unknown mixed state and must consult page_LSN to avoid reapplying a change already there. Option B reverses the pass order; A and C misdescribe what undo does.

Primary source

The canonical reference for WAL, the page_LSN invariant, steal and no-force, repeating history, Compensation Log Records, and the three-pass restart. Focus on Section 1.1 (WAL and the LSN), Section 3 (the three principles and CLRs), and Section 6 (the Analysis, Redo, and Undo passes). This is the assigned paper for the week 14 viva.

Ask your teacher

Want to see a crash land in the middle of the undo pass and watch UndoNxtLSN make recovery idempotent? Or trace how a fuzzy checkpoint shortens Analysis without quiescing the system? Ask me to extend the simulator, draw the timeline differently, or quiz you harder on the page_LSN comparisons.