Field Guide / Part II · Storage / Week 03
Storage engine

Heap files, free space, and deleting tuples

Last week we packed rows into one page. A real table is thousands of pages. Where does a new row go without reading all of them, and what happens to the space when a row is deleted?

By the end you can

  • Explain why a heap file needs free-space tracking and why a linear page walk does not scale.
  • Describe the free space map as a tree that turns "find a page with N free bytes" into a logarithmic lookup.
  • Say exactly what DELETE does (mark dead, do not erase) and why immediate erasure is the wrong default under MVCC.
  • List the four jobs of VACUUM and tell plain VACUUM apart from VACUUM FULL.
  • Explain how a value too wide for a page is stored out of line by TOAST rather than spanning pages.

Stay on the course's running query, SELECT name FROM emp WHERE salary > 75000 ORDER BY name. When the executor wants the rows of emp, the access method asks the buffer pool (week 4) for heap pages by page id, and the buffer pool asks the storage manager to read them off disk. Last week we saw the layout inside one of those pages: a header, a slot array growing down, tuple data growing up. This week zooms out one level. A table is not one page; it is a heap file of many pages. Two new questions appear that a single page never raised. On the write path of the second journey (UPDATE account SET balance = balance - 100), the engine must find a page with room for the new row version. On any path, when a row dies, the engine must decide what to do with the hole.

So this layer takes a request for a page by id from the buffer pool above, and below it owns the file: which pages exist, how much free space each one has, and which tuples in them are still alive. It hands decoded tuples back up. The dead-tuple story here is also the bottom half of MVCC, which you meet properly in week 13: the reason DELETE leaves a tombstone is that a reader on an older snapshot may still need the old row.

Intuition

A heap file is a pile of pages with no order. Inserting a row is like finding an unfilled box in a warehouse: you do not want to open every box to check. Deleting a row is like crossing a name off a ledger rather than tearing out the page, because someone might still be reading the old entry, and tearing pages out mid-shift is slow. A janitor (VACUUM) comes through later to actually reclaim the crossed-out space.

Problem one: where does an inserted row go?

A heap file is, in the words of the CMU storage notes, "an unordered collection of pages where tuples are stored in random order" [CMU 15-445 L03]. Unordered is deliberate: it makes insertion cheap because a new row can go anywhere there is room. But "anywhere there is room" hides a search. The engine has to locate a page whose free space is at least the size of the row plus its slot.

Why a list walk does not scale

The naive heap organization is a linked list of pages: a header page points at a chain of data pages, each page links to the next. To insert, you walk the chain asking each page "do you have room?" until one says yes. On a cold table that is fine. On a large table where most pages are full, you may touch hundreds of pages, each a potential disk read, to place one row. The cost is linear in the number of pages. The fix is to keep a compact summary of per-page free space that you can search without reading the pages themselves.

The first improvement is a page directory: special directory pages map each page id to its location and record how much free space it holds, so a lookup is a directory probe instead of a chain walk [CMU 15-445 L03]. PostgreSQL goes further and keeps the free-space summary in its own structure, a tree, so the search is logarithmic rather than a scan of even the directory.

Heap file (the _main fork): pages 0..5 pg0 free 12 pg1 free 200 pg2 free 40 pg3 free 8 Free space map (the _fsm fork): each node holds the max of its children root 200 200 40 12 200 40 8
Figure 1. Each leaf of the FSM stores one page's free space (quantized to a byte). Every internal node holds the larger of its children, so the root holds the maximum free space anywhere. To place a 150-byte row you descend the tree, always toward a child whose value is at least 150, and reach a fitting page without reading any heap page. PostgreSQL builds this in a separate _fsm fork [PG Free Space Map].
In real systems

PostgreSQL stores the free space map as a separate fork file named <filenode>_fsm. The bottom level keeps one byte of free-space information per heap page (256 levels, so it is approximate, not exact). Within each FSM page is a binary tree where each non-leaf node holds the larger of its children, and upper FSM levels aggregate the lower ones, so a search for "a page with at least N free bytes" walks down the tree rather than scanning a list [PG Free Space Map]. The detail lives in src/backend/storage/freespace/README.

Problem two: what happens to a deleted row?

The obvious answer, erase the bytes and slide the survivors up to close the hole, is the wrong default, for two reasons. First, under multi-version concurrency control another transaction running on an older snapshot may still need to see the row you just deleted; erasing it would break that reader. Second, immediate compaction is expensive and conflicts with anyone reading the page right now. So PostgreSQL marks the row dead instead of erasing it. It sets the deleting transaction id in the tuple header field t_xmax, and only later does the line pointer move to a dead state and the space become reusable [PG Routine Vacuuming].

This is the tombstone idea. The deletion is recorded, the readers that still need the old version can find it, and a separate maintenance pass reclaims the space once no live transaction can see it. The trade is explicit: deletes and updates are cheap and do not block, at the price of bloat that something else must clean up, and at the price of readers skipping dead versions until then. An UPDATE is delete-plus-insert under the hood: it writes a new version and stamps the old one's t_xmax, which is why heavy update workloads bloat tables [PG Routine Vacuuming]. You will see the snapshot machinery that depends on this in week 13.

live xmax = 0 DELETE deleted xmax set, kept prune dead pointer LP_DEAD VACUUM reusable free again
Figure 2. A row does not vanish on DELETE. It becomes a tombstone (xmax set), then a dead line pointer once no snapshot needs it, and only VACUUM returns the space to the page's free pool. The bytes can sit dead for a while; that gap is what the term "bloat" names.

Simulator: the heap insert router with FSM and VACUUM

Run two experiments. First, watch the cost of finding a page. Switch routing between a linked-list walk and an FSM tree jump and insert rows; the I/O counter tells the story. Second, watch bloat appear and get cleaned: delete rows to create dead tuples, then press VACUUM to reclaim them and possibly free a trailing page back to the file.

Heap insert router with FSM and VACUUM insert and delete rows, switch routing, then vacuum the bloat
Experiment: insert several rows under linked-list routing and note the pages touched per insert, then switch to FSM and insert again. Then delete a few rows to grow dead space and press VACUUM.
Keep this one thing

Inserting a row is a search for free space, and the free space map makes that search logarithmic instead of linear. Deleting a row is a mark, not an erase, and VACUUM is the only thing that turns the marked space back into usable space.

VACUUM does four jobs, not one

It is tempting to read VACUUM as "the disk-space cleaner," but the PostgreSQL documentation lists four purposes, and three of them have nothing to do with shrinking files [PG Routine Vacuuming].

The four jobs of VACUUM.
JobWhat it doesWhy it matters
Reclaim dead spaceMarks space from dead row versions reusableBounds table bloat from UPDATE and DELETE
Update statisticsRefreshes planner stats (via ANALYZE)The optimizer needs current row counts and distributions (week 11)
Maintain the visibility mapFlags pages where all tuples are visible to everyoneLets later vacuums skip clean pages and lets index-only scans avoid heap fetches
Freeze old XIDsMarks sufficiently old rows as frozenPrevents 32-bit transaction id wraparound, which would otherwise make old rows appear to be in the future
Why freezing exists at all

Transaction ids in PostgreSQL are 32 bits, and they are compared modulo 2^32: for every normal XID, two billion XIDs count as older and two billion as newer. Let the counter lap without maintenance and old committed rows suddenly compare as "in the future" and become invisible, which is silent data loss. The defense is to mark old, all-visible rows as frozen with a special XID that always compares as older than every normal one, so wraparound cannot touch them. The hard rule: every table must be vacuumed at least once every roughly two billion transactions [PG Routine Vacuuming].

Plain VACUUM and VACUUM FULL are not two settings of the same operation. Plain VACUUM marks dead space reusable in place, takes a SHARE UPDATE EXCLUSIVE lock so it runs alongside queries, and generally does not return space to the operating system. It only shrinks the file when whole trailing pages happen to be empty, which is exactly the truncation step the simulator shows. VACUUM FULL rewrites the entire table into a new file, does return space to the OS, but takes an ACCESS EXCLUSIVE lock that blocks all access to the table while it runs. The documentation recommends against VACUUM FULL in routine operation [PG Routine Vacuuming].

Plain VACUUM versus VACUUM FULL.
PropertyPlain VACUUMVACUUM FULL
How it reclaimsMarks dead space reusable in placeRewrites the whole table into a new file
Returns space to OSRarely (only empty trailing pages)Yes, the old file is dropped
Lock heldSHARE UPDATE EXCLUSIVE (queries continue)ACCESS EXCLUSIVE (blocks all access)
Routine useYes, this is the normal pathNo, the docs advise avoiding it
Exam trap

"DELETE frees disk space immediately" is false, and so is "VACUUM FULL is the normal, non-blocking way to reclaim space." DELETE only tombstones the row. Plain VACUUM reclaims in place and usually does not shrink the file. VACUUM FULL does shrink it but locks the whole table. Mixing these up is the most common storage-layer mistake.

When a value is too big for a page

A tuple cannot span pages. So what happens to a 5 KB text value when the page is 8 KB and a row also carries other columns? PostgreSQL uses TOAST, The Oversized-Attribute Storage Technique. A row wider than about 2 KB (the page divided by four) triggers it. The wide value is compressed, and if still too big it is pushed out of line into a separate TOAST table whose rows are chunks of about 2000 bytes, sized so four chunks fit on a page. What stays in the original row is an 18-byte pointer datum recording the TOAST table, the value id, and its sizes. A TOAST-able field can hold up to 1 GB this way [PG TOAST].

heap row in emp id name bio (18-byte TOAST pointer) emp_toast table: chunks of ~2000 bytes chunk 0 chunk 1 chunk 2
Figure 3. The row keeps a small fixed pointer; the bulky value lives elsewhere in chunks. Reading name never drags the wide bio through the buffer pool, and the row still fits on one page. This is why a value too large for a page does not make a row span pages.

The insert algorithm, end to end

  1. Compute the bytes the new row needs, including its slot and any alignment pad.
  2. If the row is wider than the TOAST threshold, compress and move oversized columns out of line first, leaving an 18-byte pointer.
  3. Ask the free space map for a page with at least that many free bytes; descend the tree to a fitting page in logarithmic time.
  4. If no page fits, extend the heap file with a fresh page at the end.
  5. Pin the chosen page in the buffer pool, allocate a slot, write the tuple, and update pd_lower and pd_upper.
  6. Update the page's free-space entry in the FSM so the next search sees the new value.
# insert into a heap file, FSM-routed (grounded in PostgreSQL behavior)
def heap_insert(rel, row):
    row = toast_if_oversized(rel, row)        # wide columns go out of line
    need = len(row) + SLOT_SIZE
    page_id = fsm_find_page(rel, need)        # logarithmic, not a list walk
    if page_id is None:
        page_id = rel.extend()                # add a page at the end of the file
    page = buffer_pool.pin(page_id)
    slot = page.alloc_slot(row)               # pd_lower up, pd_upper down
    fsm_update(rel, page_id, page.free_bytes())
    return (page_id, slot)                     # the stable record id

def heap_delete(page, slot, xid):
    tup = page.tuple(slot)
    tup.t_xmax = xid                          # tombstone: mark dead, do not erase
    # space becomes reusable only later, in VACUUM
Deeper: why the FSM is approximate, and what HOT does

The FSM stores free space as one byte per page, so it tracks 256 levels rather than exact byte counts. That is a deliberate space-for-precision trade: an approximate map is small enough to cache and walk cheaply, and the cost of being slightly wrong is that a search may occasionally land on a page that turns out a little too small, in which case the insert retries. Exact per-page free byte counts would bloat the map for little benefit.

Heap-only tuple (HOT) updates are a related optimization. When an UPDATE does not change any indexed column and the new version fits on the same page, PostgreSQL chains the new version to the old one through the line pointer (the LP_REDIRECT state) so the indexes do not need new entries. The index still points at the original line pointer, which redirects to the current version on the same page. This keeps index maintenance and bloat down on update-heavy tables. The four line-pointer states (LP_UNUSED, LP_NORMAL, LP_REDIRECT, LP_DEAD) are defined in src/include/storage/itemid.h.

Check yourself

Which statement about DELETE in a PostgreSQL heap file is false?

DELETE marks the row dead by setting xmax; it does not erase the bytes or free disk. Under MVCC older readers may still need the version, and reclaiming space is VACUUM's job. The other three statements are accurate.

How does the free space map make "find a page with N free bytes" cheap?

The FSM is a tree where each internal node holds the larger of its children, so the root holds the maximum free space and a descent toward a child whose value is at least N reaches a fitting page in logarithmic time. The heap stays unordered; the FSM is a separate fork.

Which claim about plain VACUUM versus VACUUM FULL is false?

VACUUM FULL takes an ACCESS EXCLUSIVE lock that blocks all access and rewrites the table; the docs advise against using it routinely. Plain VACUUM is the normal path. The other statements are correct.

Beyond reclaiming dead space, which is something VACUUM does not do?

VACUUM's four jobs are reclaiming dead space, updating statistics, maintaining the visibility map, and freezing old XIDs. It does not rebalance B+tree indexes; index structure maintenance is the index access method's own concern.

A row in emp has a 5 KB bio column on an 8 KB page. What happens on insert?

A tuple cannot span pages. TOAST compresses the wide value and, if needed, stores it out of line in a TOAST table as ~2000-byte chunks, leaving an 18-byte pointer in the heap row. Page size is fixed at compile time and is not changed per table.

Why is a linked-list heap organization a poor way to find free space at scale?

A list walk asks each page in turn whether it has room, and each check can be a disk read, so the worst case is linear in the number of pages. The FSM replaces that with a logarithmic tree search. Heaps are unordered, so the key-order and re-sort options are wrong.

In one sentence: what does DELETE actually do to a row, and when is the space reusable?
DELETE stamps the deleting transaction id in xmax to mark the row dead (a tombstone); the space becomes reusable only when a later VACUUM reclaims it, because older snapshots may still need the version.
State the structure and the guarantee of the free space map.
A separate fork storing one byte of free space per page in a tree where each internal node holds the max of its children, so "find a page with N free bytes" is a logarithmic descent rather than a linear page-list walk.
Primary source

The authoritative account of the four purposes of VACUUM, the roughly two-billion transaction XID wraparound bound and freezing, and the difference between plain VACUUM (in place, SHARE UPDATE EXCLUSIVE) and VACUUM FULL (rewrite, ACCESS EXCLUSIVE). Pair it with the Free Space Map and TOAST pages for the insert and wide-value mechanics.

Ask your teacher

Want to see the FSM tree descent drawn step by step, or trace what happens to a row's xmin and xmax across an UPDATE that becomes a HOT chain? Ask me to go deeper or quiz you harder. I am your teacher for this course, not just a document.