Field Guide / Part III · Buffer pool / Week 05
Buffer pool

LRU-K, sequential flooding, and prefetching

Last week we built the buffer pool and let it evict by recency. This week we watch that decision fail. One large table scan can throw out the pages everyone else needs, because recency is the wrong signal for a page that gets read once and never again.

By the end you can

  • Explain sequential flooding and why LRU and CLOCK are fooled by a one-pass scan.
  • Define Backward K-distance and walk an LRU-2 eviction by hand.
  • Say why LRU-K is not LFU, and what the Correlated Reference Period fixes.
  • Describe how PostgreSQL bounds flooding with clock-sweep usage counts and ring buffers.
  • Answer MCQ-style traps on optimality, MRU for scans, and what real systems use.

Follow the course's running query down to this layer: SELECT name FROM emp WHERE salary > 75000 ORDER BY name. By step 6 of journey 1, the executor and the access method are asking the buffer pool for pages by id. Week 4 handed us the mechanism: frames, a page table mapping page id to frame, pin counts, dirty flags, and an eviction cycle that picks an unpinned victim and writes it back if dirty. It also handed us two approximations of the unreachable optimum: LRU, and CLOCK as a cheap stand-in for LRU.

The layer above (the access method, weeks 6 to 7) does not care how we choose a victim. It just wants its working set to stay resident: the B+tree root, the upper internal levels, the small dimension table it keeps probing. The layer below (storage, weeks 2 to 3) charges us a real disk read for every miss and a write for every dirty eviction. So the victim choice decides our miss rate, and the miss rate decides whether the query feels instant or stalls on I/O. This lesson is where we make that choice smart enough to survive a scan. It is Milestone M1: the buffer pool is now complete.

Intuition

Picture a small bookshelf you keep your current books on. You usually grab the same five. Then you have to skim a 400-page reference once, cover to cover. If your rule is "keep the book I touched most recently," every page of that reference is now the freshest thing on the shelf, so it shoves off your five regulars. You finish the skim, never open the reference again, and now have to go back to the library for the five books you actually use. That is sequential flooding. The fix is to notice that a page you have seen exactly once is probably a one-timer, no matter how recent it is.

Why recency alone fails

LRU and CLOCK both decide eviction from a single number: the time of the last reference. The LRU-K paper puts the complaint plainly: LRU "decides which page to drop from buffer based on too little information, limiting itself to only the time of last reference" [O'Neil et al. 1993]. Two access patterns expose the weakness.

The first is frequency blindness. A B+tree leaf might be touched once every few hundred page references, while a heap data page is touched once every twenty thousand. Under LRU, if their last accesses happen to line up, they get treated identically, and the index page that deserves to stay can be evicted. LRU cannot see that one page is hot over the long run and the other is not.

The second, and the one this lesson is built around, is sequential flooding. A large sequential scan reads many pages once, in quick succession. Under LRU every one of those pages now carries the most recent timestamp, so they sit at the head of the queue and push out the genuinely hot working set. The scan pages will almost certainly never be read again. So LRU has thrown out the pages it will need and kept the pages it will not. The most recent timestamp is exactly the wrong signal here.

Why recency inverts for a scan

For a one-pass scan the future is the mirror image of LRU's assumption. LRU bets that the most recently used page is the most likely to be reused. In a scan, the most recently read page is the one you just finished with and will never revisit. So the right policy for a pure scan is the opposite of LRU: evict the most recently used page (MRU), or confine the scan to a small ring of frames so it can only ever evict itself. Recency is not wrong in general; it is wrong for the pattern where the next reference is to a brand new page, not an old one.

Reference string over time (hot set H1 H2 H3, then a scan S1..S6) H1 H2 H3 S1 S2 S3 S4 S5 S6 H1 H2 After the scan, LRU holds the freshest pages: the scan, not the hot set S3 S4 S5 S6 H1 and H2 were evicted, so they miss when re-referenced LRU-2 holds the hot set: scan pages have only one reference, so a huge K-distance H1 H2 H3 S6 H1 and H2 survive, so they hit when re-referenced
Figure 1. The same reference string, two outcomes. The hot pages H1 H2 H3 are used before and after the scan. LRU lets the six scan pages overwrite them; LRU-2 keeps them because a page seen only once has no measurable reuse interval and is dropped first.

LRU-K: count to K, not to one

The fix is to look at more than the last reference. LRU-K records the times of the last K references to each page. From those it estimates how often the page is really used, and evicts the page whose reuse looks slowest. The key quantity is the Backward K-distance.

Keep this one thing

Backward K-distance b(p, K) is how far back in the reference string the K-th most recent access to page p sits. LRU-K evicts the page with the largest Backward K-distance. If a page has been seen fewer than K times, its K-distance is infinity, so it is the first to go. Plain LRU is exactly LRU-1.

Read that definition against the scan. A scan page has been referenced once. With K = 2 we need its second most recent reference to compute a finite distance, and there is none, so b(p, 2) is infinity. A hot page has been referenced many times, so its second most recent reference is close behind, giving a small finite distance. LRU-2 evicts the infinite-distance page, which is the scan page. The hot set survives. The single reference that made the scan page look fresh to LRU is exactly what makes it look disposable to LRU-2.

When several pages all have infinite K-distance, because none has been seen K times yet, LRU-K cannot rank them by the rule above, so it falls back to a subsidiary policy (plain LRU is the usual choice) to break the tie. LRU-2 is the common setting in practice: it is the smallest K that can measure an actual interval between two references rather than guess from a single timestamp, and the paper shows it already captures most of the benefit.

Reference string (newest on the right). Same last-access time, different K-distance. older now Page A (hot): references close together b(A,2) is small: 2nd-newest is just behind Page B (one-timer + old hit): references far apart b(B,2) is large: 2nd-newest is far back, so B is evicted
Figure 2. Two pages whose most recent access is identical (rightmost dot). LRU cannot tell them apart. LRU-2 measures the gap back to the second most recent access: page A's gap is short (steady reuse), page B's is long (its earlier touch was a one-off), so B is evicted first.
Exam trap

LRU-K is not LFU (least frequently used). LFU counts every reference a page has ever had and keeps the page with the highest lifetime count, so it cannot forget popularity that has gone stale. LRU-K looks only at the last K references, which is a built-in aging window: a page that was hot last hour and is cold now ages out, because its K-th most recent reference slides into the distant past. The "which statement conflates LRU-K with LFU" item turns on exactly this. LRU-K adapts to changing access patterns; LFU does not.

The Correlated Reference Period

LRU-K has a subtle failure of its own. Suppose a transaction reads a row and then updates it a few milliseconds later. That is two references to the same page in quick succession, but they are one logical event, not evidence that the page is popular over time. If LRU-2 treated them as two independent references, the page's Backward 2-distance would collapse to almost zero and the page would look intensely hot. The next correlated burst on a different page would then evict it.

The paper's fix is the Correlated Reference Period: a short timeout window (canonically a couple of seconds) within which repeated references to the same page collapse into a single logical reference. Read-then-update inside one transaction, a retried statement, successive operations in one process: all of these fold into one. Only references separated by more than the window count as distinct evidence of reuse. Without this, LRU-K would be fooled in the opposite direction from LRU, mistaking a tight correlated burst for long-term popularity.

Race the policies on one trace

Here is the experiment to run. Build a reference string that hits a small hot working set, inject a sequential scan of a length you choose, and run OPT, LRU, CLOCK, and LRU-2 on it side by side with the same pool size. Watch the hit and miss tallies. Then turn on the ring buffer for the scan and watch the hot set stop dying. OPT is the floor nobody can beat, because it cheats by reading the future.

Replacement-policy race step to advance the shared reference string through all four pools
Try this: set frames to 3, scan length to 6, ring off, and run to the end. LRU and CLOCK miss on every re-reference of the hot set after the scan, while LRU-2 keeps it. Now turn the ring on and rerun: LRU recovers because the scan can only evict itself.

LRU-2 eviction, step by step

  1. On a reference to page p, record the current logical time, but first apply the Correlated Reference Period: if the last reference to p is within the window, treat this as the same logical reference instead of a new one.
  2. If p is resident, return its frame. No eviction.
  3. If a free frame exists, load p there.
  4. Otherwise compute each resident page's Backward K-distance: the gap back to its K-th most recent reference, or infinity if it has been seen fewer than K times.
  5. Evict the page with the largest Backward K-distance. Break ties among infinite-distance pages with the subsidiary policy (plain LRU on the single reference each has).
# LRU-2 victim selection, the heart of the policy
def victim(resident, now, hist):
    best, best_dist = None, -1
    for p in resident:
        h = hist[p]                          # last K reference times, newest first
        kdist = (now - h[K-1]) if len(h) >= K else INF
        if kdist > best_dist:            # INF (seen < K times) wins, so a scan page goes first
            best, best_dist = p, kdist
    return best

What real systems actually do

LRU-K is the theory. Production engines rarely implement it literally, because tracking K timestamps per page and keeping them sorted is expensive at the scale of millions of frames. They reach the same goal, scan resistance, by cheaper means.

In real systems

PostgreSQL does not use plain LRU and does not use LRU-K either. Its buffer manager uses clock-sweep with a per-buffer usage_count that increments on each pin and saturates at BM_MAX_USAGE_COUNT = 5. The clock hand (nextVictimBuffer) sweeps, decrements the count of any unpinned buffer, and takes the first one whose count and pin count are both zero [PG buffer README]. A small usage count gives recently and frequently touched pages several reprieves, which approximates LRU-K's intent without storing timestamps. Against floods it adds ring buffers: a sequential scan is confined to a 256 KB ring (BAS_BULKREAD, small enough to fit in L2 cache), bulk writes to 16 MB, and VACUUM to its own ring [freelist.c]. The README states the motive directly: a page touched only by a big scan "is unlikely to be needed again soon, so instead of running the normal clock-sweep algorithm and blowing out the entire buffer cache," the scan recycles a bounded set of frames. SQLite, by contrast, does use LRU in its default page cache (pcache1.c) [SQLite pragmas].

Main pool: clock-sweep, usage_count 0..5, hand decrements then evicts a 0 5 3 0 2 4 1 hand victim: pin 0 and usage 0 Scan ring (BAS_BULKREAD, ~256 KB): scan can only evict itself S7 S8 S9 S10 The hot pages in the main pool keep their usage counts and survive. A new scan page recycles the oldest ring slot, never touching the clock.
Figure 3. PostgreSQL's two mechanisms. The clock-sweep with a saturating usage count approximates frequency-aware LRU cheaply. The separate ring buffer bounds a scan so it cannot flood the main pool, which is the practical form of scan resistance LRU-K gives you in theory.

Prefetching and scan sharing

Eviction policy decides what to keep. Prefetching decides when to read, and it is the other half of making scans cheap. A disk read is slow because of latency, not just bandwidth, so issuing one read at a time and waiting wastes the device. Prefetch (read-ahead) reads pages the engine predicts it will need before they are requested, overlapping that I/O with the CPU work on pages already in memory. The CMU lecture gives two flavors: sequential prefetch during a scan (read the next blocks while processing the current ones) and index prefetch (fetch the next logical leaf in an index scan, which need not be the next physical page on disk) [CMU 15-445 L06]. In PostgreSQL, effective_io_concurrency (default 16) tells the engine how many concurrent reads it can expect and sets the prefetch distance [PG resource config].

Two related tricks share work instead of duplicating it. Scan sharing (synchronized scans) lets a second query that wants to scan the same table attach its cursor to a scan already in flight, so the two share the pages one read brings in; the engine tracks where the latecomer joined so it can wrap around and finish the part it missed. Buffer pool bypass goes the other way: a large one-shot scan reads pages into query-local memory and never inserts them into the shared pool at all, which is flood prevention by refusing to participate. The ring buffer is the in-between version that PostgreSQL ships by default.

How each policy treats a one-pass scan page versus a hot page.
PolicySignal usedScan pageHot page under flood
OPTfull futureevicted (next use is never)kept
LRU / CLOCKlast referencekept (looks fresh)evicted (flooded out)
MRUlast reference, invertedevicted (just used)kept
LRU-22nd most recent referenceevicted (K-distance infinite)kept
Ring bufferstructural confinementrecycled within ringkept (outside ring)
Deeper: why LRU-2 and not LRU-3 or LRU-10

Larger K remembers more history and reacts more slowly to change, which can help a very stable workload but hurts when access patterns shift, and it costs more storage and comparison work. LRU-2 is the smallest K that measures an actual interval between two references rather than inferring reuse from one timestamp, so it captures the bulk of the benefit at the lowest cost. The paper reports diminishing returns past K = 2 on its workloads, which is why LRU-2 is the practical default and why production systems approximate even that with a small saturating counter rather than true K-timestamp tracking.

Check yourself

Which statement is false?

The false claim is that LRU is optimal. The optimal policy is Belady's MIN, which evicts the page used furthest in the future and requires knowledge no real system has. LRU is a heuristic on the last reference, and a scan defeats it. CLOCK is a cheaper stand-in for LRU.

Which statement conflates LRU-K with LFU?

Keeping the page with the highest lifetime count is LFU, not LRU-K. LRU-K deliberately forgets: it considers only the last K references, so a page that was hot long ago but is cold now ages out. LFU cannot forget stale popularity. That built-in aging is the whole difference.

A query does one full pass over a table and never revisits a page. Which policy fits best, and why?

For a one-pass scan the next reference is to a new page, so the most recently used page is the least likely to be reused. MRU evicts it; a ring buffer confines the scan to a few frames so it can only evict itself. LRU keeps scan pages and evicts the hot set, the exact failure.

Which statement is false?

PostgreSQL does not use plain LRU. It uses clock-sweep with a per-buffer usage_count capped at BM_MAX_USAGE_COUNT (5), plus ring buffers for bulk reads, writes, and VACUUM. The other claims are accurate, including that SQLite's default pcache1 module does use LRU.

What does the Correlated Reference Period prevent?

The Correlated Reference Period collapses references close together in time (read then update in one transaction, a retry, successive operations) into one logical reference, so a tight burst does not shrink the Backward K-distance and wrongly mark the page as hot. The WAL ordering rule in the last option belongs to week 14.

Two resident pages have the same most recent access time. How does LRU-2 choose between them?

LRU cannot distinguish pages with the same last access, but LRU-2 looks back to the second most recent access. The page whose second-newest reference is further back has the larger Backward 2-distance and is evicted. Dirtiness affects the cost of eviction, not the victim choice.

Define Backward K-distance and state the LRU-K eviction rule.
Backward K-distance b(p, K) is the distance back to the K-th most recent reference to page p, or infinity if p has been seen fewer than K times. LRU-K evicts the page with the largest b(p, K), so pages seen fewer than K times go first.
Why is MRU, not LRU, the right policy for a one-pass scan?
In a one-pass scan the next reference is always to a new page, so the page you just read is the one you are least likely to touch again. MRU evicts it and protects the working set; LRU keeps it and evicts the working set.
Primary source
The LRU-K Page Replacement Algorithm for Database Disk Buffering (O'Neil, O'Neil, Weikum, SIGMOD 1993)

Read sections 1 and 2 for the cache-swamping example, the Backward K-distance definition, LRU-2 versus LRU-1, the Correlated Reference Period, and the argument that LRU-K is not LFU. PDF mirror. Pair it with CMU 15-445 Lecture 06 for sequential flooding, prefetching, and scan sharing.

Ask your teacher

Want to step through an LRU-2 eviction on your own reference string, or see exactly how PostgreSQL's usage_count differs from true LRU-K? Ask me to extend the simulator, draw the ring buffer against the clock, or quiz you harder on the K-distance arithmetic. I am your teacher for this course, not just a document.