Field Guide / Part III · Buffer pool / Week 04
Buffer pool

Buffer pool design: frames, pins, dirty flags, replacement

The CPU touches only memory, the disk is a hundred thousand times slower, and the database does not fit in RAM. The buffer pool is the lie that makes the whole database look resident. Here is how it keeps the lie consistent and decides what to throw out.

By the end you can

  • Explain why a disk-oriented engine needs its own pool instead of leaning on mmap.
  • Walk the fetch / pin / unpin / evict cycle and say what each per-frame field is for.
  • Distinguish the in-memory page table from the on-disk page directory without mixing them up.
  • Compare Belady's MIN, LRU, and CLOCK, and explain why only unpinned dirty victims cost a write.
  • Answer the "which statement is false" traps the exam likes (pinning, write-back, mmap).

Stay on the running query. We have been tracing one statement down the engine:

SELECT name FROM emp WHERE salary > 75000 ORDER BY name;

By the time control reaches this layer the executor (week 9) has a physical plan, and a scan or a B+tree access method (weeks 6 to 7) is asking for specific pages by page id. It does not know or care whether a page is on disk. It calls one function, "give me page P," and expects a pointer into memory. The buffer pool answers that call. Below it, the storage manager (week 3) reads and writes whole pages in the heap file whose slotted layout week 2 defined.

So this layer sits between "I want page 42" from above and "read or write block 42 of this file" below. It has one job stated two ways: cache the hot pages so most requests never touch disk, and when memory is full, choose a page to evict without losing data or corrupting a page someone is still reading. The same machinery carries the write path too. When the UPDATE account SET balance = balance - 100 statement modifies a page, that page lives here, marked dirty, until the buffer pool writes it back. That write-back freedom is exactly what forces ARIES recovery (week 14) to log undo and redo, which is the hand-off we set up at the end of this lesson.

Intuition

Think of a desk with room for eight open books and a library behind you with ten thousand. You read from the open books instantly. To open a new one you must close one first, and if you scribbled notes in the margin of the book you are closing, you have to copy those notes back into the library before you reshelve it. The desk is the buffer pool, each book slot is a frame, "scribbled in" is the dirty flag, and "do not close this one, I am still reading it" is a pin.

Why the engine builds its own pool

The lazy fix is to mmap the data file and let the operating system page cache do the caching. It almost works, and it breaks in ways that matter. The OS flushes dirty pages whenever it likes, which can write an uncommitted change to disk before its log record is durable and break the write-ahead rule. The DBMS cannot see which pages are resident, so it cannot schedule around a stall it could have avoided. A failed page-in surfaces as a SIGBUS the engine must catch. And the OS cannot use a replacement policy that exploits the query plan. So the engine keeps its own array of frames and its own metadata, and controls flush ordering itself. [CMU 15-445 L06]

In memory (not persisted) page table page id -> frame number 42 -> frame 3 17 -> frame 0 On disk (persisted) page directory page id -> disk location 42 -> file emp, block 42 17 -> file emp, block 17 "Give me page 42": probe the page table. Hit -> here is frame 3. Miss -> ask the page directory where 42 lives on disk, read it into a free frame.
Figure 1. Two mappings the exam loves to swap. The page table answers "is it cached and where," lives only in RAM, and is rebuilt from scratch after a restart. The page directory answers "where on disk," must survive a crash, and is what a miss consults to find the block. Confusing the two is a classic trap.

The pieces inside a frame

The pool is an array of fixed-size frames, each holding one page. Caching alone is not enough, because two problems appear the moment more than one thing happens at once. First, while the access method holds a pointer into frame 3 and walks the slot array, the pool must not reuse frame 3 for a different page, or the bytes under the pointer would change mid-read. Second, when a frame is reused, any change made to its page must reach disk, or the modification is lost. Three small fields per frame solve both.

Exam trap

"Pinning a page prevents other transactions from reading it." False. A pin only forbids eviction. Several threads can pin and read the same page at the same time. Whether they are allowed to is a question for the lock manager and latches (week 13), a completely separate mechanism. Pin count is about frame reuse, not about who may see the data.

free frame pin 0, clean resident, pinned pin>0, ref=1 dirty, unpinned pin 0, dirty clean, unpinned pin 0, clean fetch+pin modify, unpin unpin write back
Figure 2. The fields are a small state machine. Only the two unpinned states on the right can be evicted. A dirty victim must travel down the "write back" edge first (one disk write); a clean victim is simply dropped. That single edge is the whole reason a dirty miss costs a write plus a read while a clean miss costs only a read.

Simulator: fetch, pin, evict, and watch the CLOCK hand

This is a four-frame pool running a CLOCK replacement policy. Request a page by id; pin and unpin frames; modify a page to dirty it; then watch what happens when the pool is full and a new page arrives. The CLOCK hand only chooses among unpinned frames, gives a page with its reference bit set a second chance, and writes a dirty victim back before reusing its frame. Try to get an out-of-memory error by pinning everything.

Buffer pool fetch / pin / evict stepper request a page, then pin some frames and request another
Experiment: fill all four frames, pin frames 0 and 1, dirty frame 2, then request a brand new page. Watch the hand skip the pinned frames, clear a reference bit on its first pass, and write back the dirty victim. Then pin all four and request a fifth page to see the out-of-memory case.

The fetch / pin / unpin / evict cycle

  1. Upper layer requests page P. The pool probes the page table (an in-memory hash table).
  2. Hit: P is already in some frame. Increment its pin count, set its reference bit, return the frame pointer. Cost: zero disk I/O. This is the common case and the whole point of the pool.
  3. Miss: pick a frame. Use a free one if any exists. Otherwise run the replacement policy over the unpinned frames to choose a victim. If every frame is pinned, throw out of memory.
  4. If the victim is dirty, write it back first (one disk write). A clean victim is dropped with no write. Remove the victim's entry from the page table.
  5. Read P from disk into the frame (one disk read), insert P into the page table, pin it, return the pointer.
  6. When the caller finishes, it unpins P and sets the dirty flag if it wrote to the page. The page stays resident until a later eviction or a background writer flushes it.
# the buffer pool's central call, grounded in the CMU 15-445 description
def fetch_page(P):
    f = page_table.get(P)
    if f is not None:           # HIT: no I/O
        f.pin += 1; f.ref = True
        return f
    f = free_frame() or evict_victim()   # victim is unpinned only
    if f is None:
        raise OutOfMemory          # every frame pinned
    if f.dirty:
        write_to_disk(f.page)      # write-back, +1 write
    page_table.pop(f.page, None)
    f.page = P; read_from_disk(P)  # +1 read
    f.pin = 1; f.dirty = False; f.ref = True
    page_table[P] = f
    return f
Keep this one thing

Only unpinned frames can be evicted, and only a dirty victim costs a write. A hit is free, a clean miss is one read, a dirty miss is one write plus one read. Everything about buffer pool performance follows from keeping hits common and keeping the dirty-victim slow path rare.

Choosing the victim: from the impossible optimum to CLOCK

When a victim must be chosen, the policy wants to evict the page that will not be needed for the longest time. There is a provably optimal answer and a reason it is useless in practice, and the real policies are approximations that trade accuracy for cheap metadata.

Three replacement policies, from the unreachable ideal down to what engines actually run.
PolicyRuleMetadata per pageCatch
Belady's MIN (OPT)Evict the page used furthest in the future.knowledge of the futureOptimal, but needs the future reference string, so it is only a yardstick.
LRUEvict the oldest last-access time.a timestamp, plus a sorted orderGood, but a timestamp per page and a sorted structure are costly at scale.
CLOCKSweep a hand; give a referenced page one second chance, else evict.one reference bitApproximates LRU with one bit. Shares LRU's blind spots, not different ones.

CLOCK lays the frames in a circle with a hand. On each eviction the hand advances: if the current page's reference bit is 1, clear it to 0 and move on (a second chance); if it is 0, evict it. The hand position is remembered between evictions, so a page that keeps getting touched keeps having its bit re-set and survives many sweeps. That is the policy the simulator above runs, and it is essentially what PostgreSQL does, except PostgreSQL uses a small usage count from 0 to 5 instead of a single bit (the next lesson covers that and the scan-flooding problem CLOCK and LRU both share).

p10 ref 1 p11 ref 0 p12 ref 1 p13 ref 0
Figure 3. The hand points at p11 with ref 0, so p11 is the victim. If the hand had landed on p10 (ref 1) it would clear that bit to 0 and advance, giving p10 a second chance. Pages touched between sweeps keep resetting their bit and survive.
Deeper: why a dirty victim is more expensive, and what a background writer fixes

A clean victim's frame can be reused immediately, because the on-disk copy already matches what is in memory. A dirty victim's in-memory copy is the only correct version, so it has to be written out before the frame is handed to the incoming page. That puts a synchronous disk write on the critical path of a query that only wanted to read. To keep the eviction fast path common, engines run a background writer: a separate thread that walks the frames, writes out dirty pages ahead of time, and clears their dirty flags (or evicts them) so that when a foreground query needs a victim it usually finds a clean one. The writer must still obey the write-ahead rule: a dirty data page cannot be flushed before the log record describing its change is durable, which is the constraint week 14 formalizes as the page_LSN invariant.

In real systems

PostgreSQL's pool size is shared_buffers, default 128MB, with a recommended start of about 25% of RAM and little benefit above 40%, because PostgreSQL also relies on the OS page cache and would otherwise double-buffer. It uses clock-sweep with a per-buffer usage_count that saturates at 5, not plain LRU, and a shared atomic nextVictimBuffer as the hand [freelist README]. SQLite's page cache (pcache1.c) uses LRU eviction, sized by PRAGMA cache_size where a negative value such as the default -2000 means a memory budget of about 2MB rather than a page count [SQLite PRAGMA].

What this layer hands down, and the link forward

Below, the pool calls the storage manager with "read block 42 of file emp into this frame" or "write this frame back to block 42." The slotted-page decoding that turns those bytes into the name and salary columns is week 2's job; finding the right page in the heap and its free space is week 3's. This layer never parses a tuple. It moves whole pages and tracks which ones are resident, pinned, and dirty.

Forward, the write-back freedom defined here is the seed of recovery. Because the pool may evict an uncommitted dirty page to disk (a steal policy) and may not flush a committed page at commit time (a no-force policy), a crash can leave disk in a mixed state. Steal forces the log to carry undo information; no-force forces it to carry redo information. That is the exact bridge to ARIES (week 14), and it is why the engine insists on owning flush ordering instead of trusting mmap.

Check yourself

A frame is resident, has pin count 2, and its dirty flag is set. Which statement is true?

A pin count above zero blocks eviction, so it cannot be chosen now. The dirty flag means the in-memory copy is the only correct one, so eviction must write it back. Write-back means the write is deferred (not on every modification), and dirty pages are written back, not dropped. Pinning does not stop other threads from reading.

Which statement correctly separates the page table from the page directory?

The in-memory page table maps page id to the frame currently caching it and is thrown away on restart. The on-disk page directory maps page id to a disk location and must persist. The wrong options swap those roles. This is the swap the exam plants.

When is a dirty page actually written to disk in a write-back buffer pool?

Write-back defers the write to eviction time, and a background writer may do it sooner to keep the eviction fast path clean. Write-through (the first option) would write on every modification, which is what write-back avoids. The shutdown-only and same-page options are both wrong.

Which of these is false?

The false statement is the first. The optimal policy is Belady's MIN, which evicts the page used furthest in the future and needs knowledge of the future. LRU is a heuristic that approximates it and is itself approximated by CLOCK. The other three statements are all true.

Which stated reason for not just using mmap and the OS page cache is false?

The false reason is the third: the OS reads and writes file blocks perfectly well on its own. The real objections are about control: flush timing that respects the log, visibility of resident pages, and a policy that can exploit the plan. PostgreSQL in fact still uses the OS cache on top of its own pool.

The pool is full, every frame is dirty and unpinned, and a new page is requested. What happens?

Dirty does not block eviction; only a pin does. The policy picks an unpinned victim, writes it back (the slow path: one write), then reads the new page in (one read). Out of memory happens only when every frame is pinned. The pool is fixed size; it does not grow.

In one sentence: what does a pin count guarantee, and what does it not?
It guarantees the frame will not be evicted while pin count is above zero; it does not control who may read the page (that is the lock manager's job).
Hit, clean miss, dirty miss: what does each cost in disk I/O?
Hit: zero. Clean miss: one read. Dirty miss: one write (of the victim) plus one read (of the new page).
Primary source

The clearest single source for the buffer pool definition, frames, the page table versus page directory distinction, pin counts and dirty flags, the fetch and evict cycle, and the LRU and CLOCK policies. Focus on the sections on buffer pool metadata and replacement policies.

Ask your teacher

Want me to walk a specific request sequence through the simulator step by step, contrast CLOCK with plain LRU on a worked trace, or trace exactly how an evicted dirty page connects to the write-ahead rule? Ask me to go deeper or quiz you harder. I am your teacher for this course, not just a page.