Buffer pool design: frames, pins, dirty flags, replacement
The CPU touches only memory, the disk is a hundred thousand times slower, and the database does not fit in RAM. The buffer pool is the lie that makes the whole database look resident. Here is how it keeps the lie consistent and decides what to throw out.
By the end you can
- Explain why a disk-oriented engine needs its own pool instead of leaning on
mmap. - Walk the fetch / pin / unpin / evict cycle and say what each per-frame field is for.
- Distinguish the in-memory page table from the on-disk page directory without mixing them up.
- Compare Belady's MIN, LRU, and CLOCK, and explain why only unpinned dirty victims cost a write.
- Answer the "which statement is false" traps the exam likes (pinning, write-back, mmap).
Stay on the running query. We have been tracing one statement down the engine:
SELECT name FROM emp WHERE salary > 75000 ORDER BY name;
By the time control reaches this layer the executor (week 9) has a physical plan, and a scan or a B+tree access method (weeks 6 to 7) is asking for specific pages by page id. It does not know or care whether a page is on disk. It calls one function, "give me page P," and expects a pointer into memory. The buffer pool answers that call. Below it, the storage manager (week 3) reads and writes whole pages in the heap file whose slotted layout week 2 defined.
So this layer sits between "I want page 42" from above and "read or write block 42 of this file" below. It
has one job stated two ways: cache the hot pages so most requests never touch disk, and when memory is full,
choose a page to evict without losing data or corrupting a page someone is still reading. The same machinery
carries the write path too. When the UPDATE account SET balance = balance - 100 statement modifies
a page, that page lives here, marked dirty, until the buffer pool writes it back. That write-back freedom is
exactly what forces ARIES recovery (week 14) to log undo and
redo, which is the hand-off we set up at the end of this lesson.
Think of a desk with room for eight open books and a library behind you with ten thousand. You read from the open books instantly. To open a new one you must close one first, and if you scribbled notes in the margin of the book you are closing, you have to copy those notes back into the library before you reshelve it. The desk is the buffer pool, each book slot is a frame, "scribbled in" is the dirty flag, and "do not close this one, I am still reading it" is a pin.
The lazy fix is to mmap the data file and let the operating system page cache do the
caching. It almost works, and it breaks in ways that matter. The OS flushes dirty pages whenever it likes,
which can write an uncommitted change to disk before its log record is durable and break the write-ahead
rule. The DBMS cannot see which pages are resident, so it cannot schedule around a stall it could have
avoided. A failed page-in surfaces as a SIGBUS the engine must catch. And the OS cannot use a
replacement policy that exploits the query plan. So the engine keeps its own array of frames and its own
metadata, and controls flush ordering itself. [CMU 15-445 L06]
The pieces inside a frame
The pool is an array of fixed-size frames, each holding one page. Caching alone is not enough, because two problems appear the moment more than one thing happens at once. First, while the access method holds a pointer into frame 3 and walks the slot array, the pool must not reuse frame 3 for a different page, or the bytes under the pointer would change mid-read. Second, when a frame is reused, any change made to its page must reach disk, or the modification is lost. Three small fields per frame solve both.
- Pin count. How many threads are using this page right now. A thread increments it before touching the page and decrements it when done. The invariant: a frame with pin count above zero is never an eviction candidate. Pinning blocks eviction, nothing else.
- Dirty flag. Set when a thread modifies the page in memory. It tells the storage manager the page must be written back before its frame is reused. The pool is write-back, not write-through: the write is deferred to eviction time or to a background writer.
- Reference bit (or usage count). A recency signal set on access and cleared by the eviction scan. The replacement policy reads it to guess which page is least worth keeping.
"Pinning a page prevents other transactions from reading it." False. A pin only forbids eviction. Several threads can pin and read the same page at the same time. Whether they are allowed to is a question for the lock manager and latches (week 13), a completely separate mechanism. Pin count is about frame reuse, not about who may see the data.
Simulator: fetch, pin, evict, and watch the CLOCK hand
This is a four-frame pool running a CLOCK replacement policy. Request a page by id; pin and unpin frames; modify a page to dirty it; then watch what happens when the pool is full and a new page arrives. The CLOCK hand only chooses among unpinned frames, gives a page with its reference bit set a second chance, and writes a dirty victim back before reusing its frame. Try to get an out-of-memory error by pinning everything.
The fetch / pin / unpin / evict cycle
- Upper layer requests page P. The pool probes the page table (an in-memory hash table).
- Hit: P is already in some frame. Increment its pin count, set its reference bit, return the frame pointer. Cost: zero disk I/O. This is the common case and the whole point of the pool.
- Miss: pick a frame. Use a free one if any exists. Otherwise run the replacement policy over the unpinned frames to choose a victim. If every frame is pinned, throw out of memory.
- If the victim is dirty, write it back first (one disk write). A clean victim is dropped with no write. Remove the victim's entry from the page table.
- Read P from disk into the frame (one disk read), insert P into the page table, pin it, return the pointer.
- When the caller finishes, it unpins P and sets the dirty flag if it wrote to the page. The page stays resident until a later eviction or a background writer flushes it.
# the buffer pool's central call, grounded in the CMU 15-445 description
def fetch_page(P):
f = page_table.get(P)
if f is not None: # HIT: no I/O
f.pin += 1; f.ref = True
return f
f = free_frame() or evict_victim() # victim is unpinned only
if f is None:
raise OutOfMemory # every frame pinned
if f.dirty:
write_to_disk(f.page) # write-back, +1 write
page_table.pop(f.page, None)
f.page = P; read_from_disk(P) # +1 read
f.pin = 1; f.dirty = False; f.ref = True
page_table[P] = f
return f
Only unpinned frames can be evicted, and only a dirty victim costs a write. A hit is free, a clean miss is one read, a dirty miss is one write plus one read. Everything about buffer pool performance follows from keeping hits common and keeping the dirty-victim slow path rare.
Choosing the victim: from the impossible optimum to CLOCK
When a victim must be chosen, the policy wants to evict the page that will not be needed for the longest time. There is a provably optimal answer and a reason it is useless in practice, and the real policies are approximations that trade accuracy for cheap metadata.
| Policy | Rule | Metadata per page | Catch |
|---|---|---|---|
| Belady's MIN (OPT) | Evict the page used furthest in the future. | knowledge of the future | Optimal, but needs the future reference string, so it is only a yardstick. |
| LRU | Evict the oldest last-access time. | a timestamp, plus a sorted order | Good, but a timestamp per page and a sorted structure are costly at scale. |
| CLOCK | Sweep a hand; give a referenced page one second chance, else evict. | one reference bit | Approximates LRU with one bit. Shares LRU's blind spots, not different ones. |
CLOCK lays the frames in a circle with a hand. On each eviction the hand advances: if the current page's reference bit is 1, clear it to 0 and move on (a second chance); if it is 0, evict it. The hand position is remembered between evictions, so a page that keeps getting touched keeps having its bit re-set and survives many sweeps. That is the policy the simulator above runs, and it is essentially what PostgreSQL does, except PostgreSQL uses a small usage count from 0 to 5 instead of a single bit (the next lesson covers that and the scan-flooding problem CLOCK and LRU both share).
Deeper: why a dirty victim is more expensive, and what a background writer fixes
A clean victim's frame can be reused immediately, because the on-disk copy already matches what is in
memory. A dirty victim's in-memory copy is the only correct version, so it has to be written out before the
frame is handed to the incoming page. That puts a synchronous disk write on the critical path of a query
that only wanted to read. To keep the eviction fast path common, engines run a background writer: a
separate thread that walks the frames, writes out dirty pages ahead of time, and clears their dirty flags
(or evicts them) so that when a foreground query needs a victim it usually finds a clean one. The writer
must still obey the write-ahead rule: a dirty data page cannot be flushed before the log record describing
its change is durable, which is the constraint week 14
formalizes as the page_LSN invariant.
PostgreSQL's pool size is shared_buffers, default 128MB, with a recommended start of about
25% of RAM and little benefit above 40%, because PostgreSQL also relies on the OS page cache and would
otherwise double-buffer. It uses clock-sweep with a per-buffer usage_count that saturates at 5,
not plain LRU, and a shared atomic nextVictimBuffer as the hand
[freelist README].
SQLite's page cache (pcache1.c) uses LRU eviction, sized by PRAGMA cache_size where a
negative value such as the default -2000 means a memory budget of about 2MB rather than a page
count [SQLite PRAGMA].
What this layer hands down, and the link forward
Below, the pool calls the storage manager with "read block 42 of file emp into this frame" or "write this
frame back to block 42." The slotted-page decoding that turns those bytes into the name and
salary columns is week 2's job; finding the right page in the heap and its free space is week 3's.
This layer never parses a tuple. It moves whole pages and tracks which ones are resident, pinned, and dirty.
Forward, the write-back freedom defined here is the seed of recovery. Because the pool may evict an
uncommitted dirty page to disk (a steal policy) and may not flush a committed page at commit time (a
no-force policy), a crash can leave disk in a mixed state. Steal forces the log to carry undo
information; no-force forces it to carry redo information. That is the exact bridge to
ARIES (week 14), and it is why the engine insists on owning
flush ordering instead of trusting mmap.
Check yourself
A frame is resident, has pin count 2, and its dirty flag is set. Which statement is true?
A pin count above zero blocks eviction, so it cannot be chosen now. The dirty flag means the in-memory copy is the only correct one, so eviction must write it back. Write-back means the write is deferred (not on every modification), and dirty pages are written back, not dropped. Pinning does not stop other threads from reading.
Which statement correctly separates the page table from the page directory?
The in-memory page table maps page id to the frame currently caching it and is thrown away on restart. The on-disk page directory maps page id to a disk location and must persist. The wrong options swap those roles. This is the swap the exam plants.
When is a dirty page actually written to disk in a write-back buffer pool?
Write-back defers the write to eviction time, and a background writer may do it sooner to keep the eviction fast path clean. Write-through (the first option) would write on every modification, which is what write-back avoids. The shutdown-only and same-page options are both wrong.
Which of these is false?
The false statement is the first. The optimal policy is Belady's MIN, which evicts the page used furthest in the future and needs knowledge of the future. LRU is a heuristic that approximates it and is itself approximated by CLOCK. The other three statements are all true.
Which stated reason for not just using mmap and the OS page cache is false?
The false reason is the third: the OS reads and writes file blocks perfectly well on its own. The real objections are about control: flush timing that respects the log, visibility of resident pages, and a policy that can exploit the plan. PostgreSQL in fact still uses the OS cache on top of its own pool.
The pool is full, every frame is dirty and unpinned, and a new page is requested. What happens?
Dirty does not block eviction; only a pin does. The policy picks an unpinned victim, writes it back (the slow path: one write), then reads the new page in (one read). Out of memory happens only when every frame is pinned. The pool is fixed size; it does not grow.
The clearest single source for the buffer pool definition, frames, the page table versus page directory distinction, pin counts and dirty flags, the fetch and evict cycle, and the LRU and CLOCK policies. Focus on the sections on buffer pool metadata and replacement policies.
Want me to walk a specific request sequence through the simulator step by step, contrast CLOCK with plain LRU on a worked trace, or trace exactly how an evicted dirty page connects to the write-ahead rule? Ask me to go deeper or quiz you harder. I am your teacher for this course, not just a page.