LRU-K, sequential flooding, and prefetching
Last week we built the buffer pool and let it evict by recency. This week we watch that decision fail. One large table scan can throw out the pages everyone else needs, because recency is the wrong signal for a page that gets read once and never again.
By the end you can
- Explain sequential flooding and why LRU and CLOCK are fooled by a one-pass scan.
- Define Backward K-distance and walk an LRU-2 eviction by hand.
- Say why LRU-K is not LFU, and what the Correlated Reference Period fixes.
- Describe how PostgreSQL bounds flooding with clock-sweep usage counts and ring buffers.
- Answer MCQ-style traps on optimality, MRU for scans, and what real systems use.
Follow the course's running query down to this layer:
SELECT name FROM emp WHERE salary > 75000 ORDER BY name. By step 6 of
journey 1, the executor and the access method are asking
the buffer pool for pages by id. Week 4 handed us the
mechanism: frames, a page table mapping page id to frame, pin counts, dirty flags, and an
eviction cycle that picks an unpinned victim and writes it back if dirty. It also handed us two
approximations of the unreachable optimum: LRU, and CLOCK as a cheap stand-in for LRU.
The layer above (the access method, weeks 6 to 7) does not care how we choose a victim. It just wants its working set to stay resident: the B+tree root, the upper internal levels, the small dimension table it keeps probing. The layer below (storage, weeks 2 to 3) charges us a real disk read for every miss and a write for every dirty eviction. So the victim choice decides our miss rate, and the miss rate decides whether the query feels instant or stalls on I/O. This lesson is where we make that choice smart enough to survive a scan. It is Milestone M1: the buffer pool is now complete.
Picture a small bookshelf you keep your current books on. You usually grab the same five. Then you have to skim a 400-page reference once, cover to cover. If your rule is "keep the book I touched most recently," every page of that reference is now the freshest thing on the shelf, so it shoves off your five regulars. You finish the skim, never open the reference again, and now have to go back to the library for the five books you actually use. That is sequential flooding. The fix is to notice that a page you have seen exactly once is probably a one-timer, no matter how recent it is.
Why recency alone fails
LRU and CLOCK both decide eviction from a single number: the time of the last reference. The LRU-K paper puts the complaint plainly: LRU "decides which page to drop from buffer based on too little information, limiting itself to only the time of last reference" [O'Neil et al. 1993]. Two access patterns expose the weakness.
The first is frequency blindness. A B+tree leaf might be touched once every few hundred page references, while a heap data page is touched once every twenty thousand. Under LRU, if their last accesses happen to line up, they get treated identically, and the index page that deserves to stay can be evicted. LRU cannot see that one page is hot over the long run and the other is not.
The second, and the one this lesson is built around, is sequential flooding. A large sequential scan reads many pages once, in quick succession. Under LRU every one of those pages now carries the most recent timestamp, so they sit at the head of the queue and push out the genuinely hot working set. The scan pages will almost certainly never be read again. So LRU has thrown out the pages it will need and kept the pages it will not. The most recent timestamp is exactly the wrong signal here.
For a one-pass scan the future is the mirror image of LRU's assumption. LRU bets that the most recently used page is the most likely to be reused. In a scan, the most recently read page is the one you just finished with and will never revisit. So the right policy for a pure scan is the opposite of LRU: evict the most recently used page (MRU), or confine the scan to a small ring of frames so it can only ever evict itself. Recency is not wrong in general; it is wrong for the pattern where the next reference is to a brand new page, not an old one.
LRU-K: count to K, not to one
The fix is to look at more than the last reference. LRU-K records the times of the last K references to each page. From those it estimates how often the page is really used, and evicts the page whose reuse looks slowest. The key quantity is the Backward K-distance.
Backward K-distance b(p, K) is how far back in the reference string the K-th most recent access to page p sits. LRU-K evicts the page with the largest Backward K-distance. If a page has been seen fewer than K times, its K-distance is infinity, so it is the first to go. Plain LRU is exactly LRU-1.
Read that definition against the scan. A scan page has been referenced once. With K = 2 we need its second most recent reference to compute a finite distance, and there is none, so b(p, 2) is infinity. A hot page has been referenced many times, so its second most recent reference is close behind, giving a small finite distance. LRU-2 evicts the infinite-distance page, which is the scan page. The hot set survives. The single reference that made the scan page look fresh to LRU is exactly what makes it look disposable to LRU-2.
When several pages all have infinite K-distance, because none has been seen K times yet, LRU-K cannot rank them by the rule above, so it falls back to a subsidiary policy (plain LRU is the usual choice) to break the tie. LRU-2 is the common setting in practice: it is the smallest K that can measure an actual interval between two references rather than guess from a single timestamp, and the paper shows it already captures most of the benefit.
LRU-K is not LFU (least frequently used). LFU counts every reference a page has ever had and keeps the page with the highest lifetime count, so it cannot forget popularity that has gone stale. LRU-K looks only at the last K references, which is a built-in aging window: a page that was hot last hour and is cold now ages out, because its K-th most recent reference slides into the distant past. The "which statement conflates LRU-K with LFU" item turns on exactly this. LRU-K adapts to changing access patterns; LFU does not.
The Correlated Reference Period
LRU-K has a subtle failure of its own. Suppose a transaction reads a row and then updates it a few milliseconds later. That is two references to the same page in quick succession, but they are one logical event, not evidence that the page is popular over time. If LRU-2 treated them as two independent references, the page's Backward 2-distance would collapse to almost zero and the page would look intensely hot. The next correlated burst on a different page would then evict it.
The paper's fix is the Correlated Reference Period: a short timeout window (canonically a couple of seconds) within which repeated references to the same page collapse into a single logical reference. Read-then-update inside one transaction, a retried statement, successive operations in one process: all of these fold into one. Only references separated by more than the window count as distinct evidence of reuse. Without this, LRU-K would be fooled in the opposite direction from LRU, mistaking a tight correlated burst for long-term popularity.
Race the policies on one trace
Here is the experiment to run. Build a reference string that hits a small hot working set, inject a sequential scan of a length you choose, and run OPT, LRU, CLOCK, and LRU-2 on it side by side with the same pool size. Watch the hit and miss tallies. Then turn on the ring buffer for the scan and watch the hot set stop dying. OPT is the floor nobody can beat, because it cheats by reading the future.
LRU-2 eviction, step by step
- On a reference to page p, record the current logical time, but first apply the Correlated Reference Period: if the last reference to p is within the window, treat this as the same logical reference instead of a new one.
- If p is resident, return its frame. No eviction.
- If a free frame exists, load p there.
- Otherwise compute each resident page's Backward K-distance: the gap back to its K-th most recent reference, or infinity if it has been seen fewer than K times.
- Evict the page with the largest Backward K-distance. Break ties among infinite-distance pages with the subsidiary policy (plain LRU on the single reference each has).
# LRU-2 victim selection, the heart of the policy
def victim(resident, now, hist):
best, best_dist = None, -1
for p in resident:
h = hist[p] # last K reference times, newest first
kdist = (now - h[K-1]) if len(h) >= K else INF
if kdist > best_dist: # INF (seen < K times) wins, so a scan page goes first
best, best_dist = p, kdist
return best
What real systems actually do
LRU-K is the theory. Production engines rarely implement it literally, because tracking K timestamps per page and keeping them sorted is expensive at the scale of millions of frames. They reach the same goal, scan resistance, by cheaper means.
PostgreSQL does not use plain LRU and does not use LRU-K either. Its buffer manager uses
clock-sweep with a per-buffer usage_count that increments on each pin and saturates at
BM_MAX_USAGE_COUNT = 5. The clock hand (nextVictimBuffer) sweeps, decrements
the count of any unpinned buffer, and takes the first one whose count and pin count are both zero
[PG buffer README].
A small usage count gives recently and frequently touched pages several reprieves, which approximates
LRU-K's intent without storing timestamps. Against floods it adds ring buffers: a sequential scan is
confined to a 256 KB ring (BAS_BULKREAD, small enough to fit in L2 cache), bulk writes to
16 MB, and VACUUM to its own ring
[freelist.c].
The README states the motive directly: a page touched only by a big scan "is unlikely to be needed
again soon, so instead of running the normal clock-sweep algorithm and blowing out the entire buffer
cache," the scan recycles a bounded set of frames. SQLite, by contrast, does use LRU in its default
page cache (pcache1.c) [SQLite pragmas].
Prefetching and scan sharing
Eviction policy decides what to keep. Prefetching decides when to read, and it is the other half of
making scans cheap. A disk read is slow because of latency, not just bandwidth, so issuing one read at
a time and waiting wastes the device. Prefetch (read-ahead) reads pages the engine predicts it will
need before they are requested, overlapping that I/O with the CPU work on pages already in memory. The
CMU lecture gives two flavors: sequential prefetch during a scan (read the next blocks while processing
the current ones) and index prefetch (fetch the next logical leaf in an index scan, which need not be
the next physical page on disk)
[CMU 15-445 L06].
In PostgreSQL, effective_io_concurrency (default 16) tells the engine how many concurrent
reads it can expect and sets the prefetch distance
[PG resource config].
Two related tricks share work instead of duplicating it. Scan sharing (synchronized scans) lets a second query that wants to scan the same table attach its cursor to a scan already in flight, so the two share the pages one read brings in; the engine tracks where the latecomer joined so it can wrap around and finish the part it missed. Buffer pool bypass goes the other way: a large one-shot scan reads pages into query-local memory and never inserts them into the shared pool at all, which is flood prevention by refusing to participate. The ring buffer is the in-between version that PostgreSQL ships by default.
| Policy | Signal used | Scan page | Hot page under flood |
|---|---|---|---|
| OPT | full future | evicted (next use is never) | kept |
| LRU / CLOCK | last reference | kept (looks fresh) | evicted (flooded out) |
| MRU | last reference, inverted | evicted (just used) | kept |
| LRU-2 | 2nd most recent reference | evicted (K-distance infinite) | kept |
| Ring buffer | structural confinement | recycled within ring | kept (outside ring) |
Deeper: why LRU-2 and not LRU-3 or LRU-10
Larger K remembers more history and reacts more slowly to change, which can help a very stable workload but hurts when access patterns shift, and it costs more storage and comparison work. LRU-2 is the smallest K that measures an actual interval between two references rather than inferring reuse from one timestamp, so it captures the bulk of the benefit at the lowest cost. The paper reports diminishing returns past K = 2 on its workloads, which is why LRU-2 is the practical default and why production systems approximate even that with a small saturating counter rather than true K-timestamp tracking.
Check yourself
Which statement is false?
The false claim is that LRU is optimal. The optimal policy is Belady's MIN, which evicts the page used furthest in the future and requires knowledge no real system has. LRU is a heuristic on the last reference, and a scan defeats it. CLOCK is a cheaper stand-in for LRU.
Which statement conflates LRU-K with LFU?
Keeping the page with the highest lifetime count is LFU, not LRU-K. LRU-K deliberately forgets: it considers only the last K references, so a page that was hot long ago but is cold now ages out. LFU cannot forget stale popularity. That built-in aging is the whole difference.
A query does one full pass over a table and never revisits a page. Which policy fits best, and why?
For a one-pass scan the next reference is to a new page, so the most recently used page is the least likely to be reused. MRU evicts it; a ring buffer confines the scan to a few frames so it can only evict itself. LRU keeps scan pages and evicts the hot set, the exact failure.
Which statement is false?
PostgreSQL does not use plain LRU. It uses clock-sweep with a per-buffer usage_count capped at BM_MAX_USAGE_COUNT (5), plus ring buffers for bulk reads, writes, and VACUUM. The other claims are accurate, including that SQLite's default pcache1 module does use LRU.
What does the Correlated Reference Period prevent?
The Correlated Reference Period collapses references close together in time (read then update in one transaction, a retry, successive operations) into one logical reference, so a tight burst does not shrink the Backward K-distance and wrongly mark the page as hot. The WAL ordering rule in the last option belongs to week 14.
Two resident pages have the same most recent access time. How does LRU-2 choose between them?
LRU cannot distinguish pages with the same last access, but LRU-2 looks back to the second most recent access. The page whose second-newest reference is further back has the larger Backward 2-distance and is evicted. Dirtiness affects the cost of eviction, not the victim choice.
Read sections 1 and 2 for the cache-swamping example, the Backward K-distance definition, LRU-2 versus LRU-1, the Correlated Reference Period, and the argument that LRU-K is not LFU. PDF mirror. Pair it with CMU 15-445 Lecture 06 for sequential flooding, prefetching, and scan sharing.
Want to step through an LRU-2 eviction on your own reference string, or see exactly how PostgreSQL's usage_count differs from true LRU-K? Ask me to extend the simulator, draw the ring buffer against the clock, or quiz you harder on the K-distance arithmetic. I am your teacher for this course, not just a document.