Heap files, free space, and deleting tuples
Last week we packed rows into one page. A real table is thousands of pages. Where does a new row go without reading all of them, and what happens to the space when a row is deleted?
By the end you can
- Explain why a heap file needs free-space tracking and why a linear page walk does not scale.
- Describe the free space map as a tree that turns "find a page with N free bytes" into a logarithmic lookup.
- Say exactly what DELETE does (mark dead, do not erase) and why immediate erasure is the wrong default under MVCC.
- List the four jobs of VACUUM and tell plain VACUUM apart from VACUUM FULL.
- Explain how a value too wide for a page is stored out of line by TOAST rather than spanning pages.
Stay on the course's running query, SELECT name FROM emp WHERE salary > 75000 ORDER BY name.
When the executor wants the rows of emp, the access method asks the buffer pool
(week 4) for heap pages by page id, and the buffer pool asks
the storage manager to read them off disk. Last week we
saw the layout inside one of those pages: a header, a slot array growing down, tuple data growing up.
This week zooms out one level. A table is not one page; it is a heap file of many pages. Two new questions
appear that a single page never raised. On the write path of the second journey
(UPDATE account SET balance = balance - 100), the engine must find a page with room for the
new row version. On any path, when a row dies, the engine must decide what to do with the hole.
So this layer takes a request for a page by id from the buffer pool above, and below it owns the file: which pages exist, how much free space each one has, and which tuples in them are still alive. It hands decoded tuples back up. The dead-tuple story here is also the bottom half of MVCC, which you meet properly in week 13: the reason DELETE leaves a tombstone is that a reader on an older snapshot may still need the old row.
A heap file is a pile of pages with no order. Inserting a row is like finding an unfilled box in a warehouse: you do not want to open every box to check. Deleting a row is like crossing a name off a ledger rather than tearing out the page, because someone might still be reading the old entry, and tearing pages out mid-shift is slow. A janitor (VACUUM) comes through later to actually reclaim the crossed-out space.
Problem one: where does an inserted row go?
A heap file is, in the words of the CMU storage notes, "an unordered collection of pages where tuples are stored in random order" [CMU 15-445 L03]. Unordered is deliberate: it makes insertion cheap because a new row can go anywhere there is room. But "anywhere there is room" hides a search. The engine has to locate a page whose free space is at least the size of the row plus its slot.
The naive heap organization is a linked list of pages: a header page points at a chain of data pages, each page links to the next. To insert, you walk the chain asking each page "do you have room?" until one says yes. On a cold table that is fine. On a large table where most pages are full, you may touch hundreds of pages, each a potential disk read, to place one row. The cost is linear in the number of pages. The fix is to keep a compact summary of per-page free space that you can search without reading the pages themselves.
The first improvement is a page directory: special directory pages map each page id to its location and record how much free space it holds, so a lookup is a directory probe instead of a chain walk [CMU 15-445 L03]. PostgreSQL goes further and keeps the free-space summary in its own structure, a tree, so the search is logarithmic rather than a scan of even the directory.
_fsm fork [PG Free Space Map].PostgreSQL stores the free space map as a separate fork file named <filenode>_fsm.
The bottom level keeps one byte of free-space information per heap page (256 levels, so it is approximate,
not exact). Within each FSM page is a binary tree where each non-leaf node holds the larger of its
children, and upper FSM levels aggregate the lower ones, so a search for "a page with at least N free
bytes" walks down the tree rather than scanning a list
[PG Free Space Map]. The detail
lives in src/backend/storage/freespace/README.
Problem two: what happens to a deleted row?
The obvious answer, erase the bytes and slide the survivors up to close the hole, is the wrong default,
for two reasons. First, under multi-version concurrency control another transaction running on an older
snapshot may still need to see the row you just deleted; erasing it would break that reader. Second,
immediate compaction is expensive and conflicts with anyone reading the page right now. So PostgreSQL marks
the row dead instead of erasing it. It sets the deleting transaction id in the tuple header field
t_xmax, and only later does the line pointer move to a dead state and the space become reusable
[PG Routine Vacuuming].
This is the tombstone idea. The deletion is recorded, the readers that still need the old version can
find it, and a separate maintenance pass reclaims the space once no live transaction can see it. The trade
is explicit: deletes and updates are cheap and do not block, at the price of bloat that something else must
clean up, and at the price of readers skipping dead versions until then. An UPDATE is delete-plus-insert
under the hood: it writes a new version and stamps the old one's t_xmax, which is why heavy
update workloads bloat tables [PG Routine Vacuuming].
You will see the snapshot machinery that depends on this in week 13.
Simulator: the heap insert router with FSM and VACUUM
Run two experiments. First, watch the cost of finding a page. Switch routing between a linked-list walk and an FSM tree jump and insert rows; the I/O counter tells the story. Second, watch bloat appear and get cleaned: delete rows to create dead tuples, then press VACUUM to reclaim them and possibly free a trailing page back to the file.
Inserting a row is a search for free space, and the free space map makes that search logarithmic instead of linear. Deleting a row is a mark, not an erase, and VACUUM is the only thing that turns the marked space back into usable space.
VACUUM does four jobs, not one
It is tempting to read VACUUM as "the disk-space cleaner," but the PostgreSQL documentation lists four purposes, and three of them have nothing to do with shrinking files [PG Routine Vacuuming].
| Job | What it does | Why it matters |
|---|---|---|
| Reclaim dead space | Marks space from dead row versions reusable | Bounds table bloat from UPDATE and DELETE |
| Update statistics | Refreshes planner stats (via ANALYZE) | The optimizer needs current row counts and distributions (week 11) |
| Maintain the visibility map | Flags pages where all tuples are visible to everyone | Lets later vacuums skip clean pages and lets index-only scans avoid heap fetches |
| Freeze old XIDs | Marks sufficiently old rows as frozen | Prevents 32-bit transaction id wraparound, which would otherwise make old rows appear to be in the future |
Transaction ids in PostgreSQL are 32 bits, and they are compared modulo 2^32: for every normal XID, two billion XIDs count as older and two billion as newer. Let the counter lap without maintenance and old committed rows suddenly compare as "in the future" and become invisible, which is silent data loss. The defense is to mark old, all-visible rows as frozen with a special XID that always compares as older than every normal one, so wraparound cannot touch them. The hard rule: every table must be vacuumed at least once every roughly two billion transactions [PG Routine Vacuuming].
Plain VACUUM and VACUUM FULL are not two settings of the same operation. Plain VACUUM marks dead space reusable in place, takes a SHARE UPDATE EXCLUSIVE lock so it runs alongside queries, and generally does not return space to the operating system. It only shrinks the file when whole trailing pages happen to be empty, which is exactly the truncation step the simulator shows. VACUUM FULL rewrites the entire table into a new file, does return space to the OS, but takes an ACCESS EXCLUSIVE lock that blocks all access to the table while it runs. The documentation recommends against VACUUM FULL in routine operation [PG Routine Vacuuming].
| Property | Plain VACUUM | VACUUM FULL |
|---|---|---|
| How it reclaims | Marks dead space reusable in place | Rewrites the whole table into a new file |
| Returns space to OS | Rarely (only empty trailing pages) | Yes, the old file is dropped |
| Lock held | SHARE UPDATE EXCLUSIVE (queries continue) | ACCESS EXCLUSIVE (blocks all access) |
| Routine use | Yes, this is the normal path | No, the docs advise avoiding it |
"DELETE frees disk space immediately" is false, and so is "VACUUM FULL is the normal, non-blocking way to reclaim space." DELETE only tombstones the row. Plain VACUUM reclaims in place and usually does not shrink the file. VACUUM FULL does shrink it but locks the whole table. Mixing these up is the most common storage-layer mistake.
When a value is too big for a page
A tuple cannot span pages. So what happens to a 5 KB text value when the page is 8 KB and a
row also carries other columns? PostgreSQL uses TOAST, The Oversized-Attribute Storage Technique. A row
wider than about 2 KB (the page divided by four) triggers it. The wide value is compressed, and if still too
big it is pushed out of line into a separate TOAST table whose rows are chunks of about 2000 bytes, sized so
four chunks fit on a page. What stays in the original row is an 18-byte pointer datum recording the TOAST
table, the value id, and its sizes. A TOAST-able field can hold up to 1 GB this way
[PG TOAST].
name never drags the wide bio through the buffer pool, and the
row still fits on one page. This is why a value too large for a page does not make a row span pages.The insert algorithm, end to end
- Compute the bytes the new row needs, including its slot and any alignment pad.
- If the row is wider than the TOAST threshold, compress and move oversized columns out of line first, leaving an 18-byte pointer.
- Ask the free space map for a page with at least that many free bytes; descend the tree to a fitting page in logarithmic time.
- If no page fits, extend the heap file with a fresh page at the end.
- Pin the chosen page in the buffer pool, allocate a slot, write the tuple, and update
pd_lowerandpd_upper. - Update the page's free-space entry in the FSM so the next search sees the new value.
# insert into a heap file, FSM-routed (grounded in PostgreSQL behavior)
def heap_insert(rel, row):
row = toast_if_oversized(rel, row) # wide columns go out of line
need = len(row) + SLOT_SIZE
page_id = fsm_find_page(rel, need) # logarithmic, not a list walk
if page_id is None:
page_id = rel.extend() # add a page at the end of the file
page = buffer_pool.pin(page_id)
slot = page.alloc_slot(row) # pd_lower up, pd_upper down
fsm_update(rel, page_id, page.free_bytes())
return (page_id, slot) # the stable record id
def heap_delete(page, slot, xid):
tup = page.tuple(slot)
tup.t_xmax = xid # tombstone: mark dead, do not erase
# space becomes reusable only later, in VACUUM
Deeper: why the FSM is approximate, and what HOT does
The FSM stores free space as one byte per page, so it tracks 256 levels rather than exact byte counts. That is a deliberate space-for-precision trade: an approximate map is small enough to cache and walk cheaply, and the cost of being slightly wrong is that a search may occasionally land on a page that turns out a little too small, in which case the insert retries. Exact per-page free byte counts would bloat the map for little benefit.
Heap-only tuple (HOT) updates are a related optimization. When an UPDATE does not change any indexed
column and the new version fits on the same page, PostgreSQL chains the new version to the old one
through the line pointer (the LP_REDIRECT state) so the indexes do not need new entries.
The index still points at the original line pointer, which redirects to the current version on the same
page. This keeps index maintenance and bloat down on update-heavy tables. The four line-pointer states
(LP_UNUSED, LP_NORMAL, LP_REDIRECT, LP_DEAD) are
defined in src/include/storage/itemid.h.
Check yourself
Which statement about DELETE in a PostgreSQL heap file is false?
DELETE marks the row dead by setting xmax; it does not erase the bytes or free disk. Under MVCC older readers may still need the version, and reclaiming space is VACUUM's job. The other three statements are accurate.
How does the free space map make "find a page with N free bytes" cheap?
The FSM is a tree where each internal node holds the larger of its children, so the root holds the maximum free space and a descent toward a child whose value is at least N reaches a fitting page in logarithmic time. The heap stays unordered; the FSM is a separate fork.
Which claim about plain VACUUM versus VACUUM FULL is false?
VACUUM FULL takes an ACCESS EXCLUSIVE lock that blocks all access and rewrites the table; the docs advise against using it routinely. Plain VACUUM is the normal path. The other statements are correct.
Beyond reclaiming dead space, which is something VACUUM does not do?
VACUUM's four jobs are reclaiming dead space, updating statistics, maintaining the visibility map, and freezing old XIDs. It does not rebalance B+tree indexes; index structure maintenance is the index access method's own concern.
A row in emp has a 5 KB bio column on an 8 KB page. What happens on insert?
A tuple cannot span pages. TOAST compresses the wide value and, if needed, stores it out of line in a TOAST table as ~2000-byte chunks, leaving an 18-byte pointer in the heap row. Page size is fixed at compile time and is not changed per table.
Why is a linked-list heap organization a poor way to find free space at scale?
A list walk asks each page in turn whether it has room, and each check can be a disk read, so the worst case is linear in the number of pages. The FSM replaces that with a logarithmic tree search. Heaps are unordered, so the key-order and re-sort options are wrong.
The authoritative account of the four purposes of VACUUM, the roughly two-billion transaction XID wraparound bound and freezing, and the difference between plain VACUUM (in place, SHARE UPDATE EXCLUSIVE) and VACUUM FULL (rewrite, ACCESS EXCLUSIVE). Pair it with the Free Space Map and TOAST pages for the insert and wide-value mechanics.
Want to see the FSM tree descent drawn step by step, or trace what happens to a row's xmin and xmax across an UPDATE that becomes a HOT chain? Ask me to go deeper or quiz you harder. I am your teacher for this course, not just a document.