Storage engine: disk layout, pages, and slotted records
A disk moves whole blocks and a row can be any size. So how do you pack variable-length rows into a fixed-size page, find any one of them in a single read, and keep its address stable even after you shuffle the bytes around? The slotted page is the answer, and once you see the constraint it solves, it stops looking arbitrary.
By the end you can
- Say why a flat append-only file breaks on deletes and on addressing, and what a page buys you.
- Draw a slotted page from memory: header, a slot array closing down, tuple data closing up, free space in the gap.
- Explain why a record id is (page, slot) and not a byte offset, and why that indirection lets tuples move.
- Decode a tuple's bytes: header, NULL bitmap, fixed and variable fields, and the alignment pad that makes column order matter.
- Answer MCQ-style traps about pd_lower / pd_upper, line pointer states, and the 32 KB heap page cap.
Where this sits in the engine
Hold the course's running query in your head:
SELECT name FROM emp WHERE salary > 75000 ORDER BY name;
In week 1 we watched that string fall through five components
to the bottom of the stack. This week is the bottom. When the executor's scan asks the buffer pool for a page,
and the buffer pool reads that page off disk, this layer is what those bytes mean. The storage manager
hands the layer above a fixed-size page; inside that page, the layout we build here lets it find each
emp row, pull out the name and salary columns, and decide whether
salary > 75000.
What gets handed in, and what gets handed down:
- From above (the buffer pool, week 4): a request for a page by id, and later a frame of memory holding that page's exact bytes.
- Down to the device: a read or write of one fixed-size block. The disk knows nothing about rows. It moves blocks.
So the storage engine's whole job is to be the translator between "rows and columns" up top and "fixed blocks of bytes" down below. Next week (heap files and free space) zooms out from one page to the file of many pages: where a new row goes, and what a delete leaves behind. This week we stay inside a single page.
The problem, stated honestly
Three facts about persistent storage force everything below.
First, the device is block-addressable, not byte-addressable. To read one salary you cannot pull one number; the hardware drags in a whole block. The CMU storage notes put it bluntly: to read a value at some offset "the program first has to load the 4 KB page into memory" [CMU 15-445 L03]. So the unit of work is the page, never the byte.
Second, disk is slow on a scale that is hard to feel. If an L1 cache reference took one second, the same notes say, an SSD read would take about 4.4 hours and a spinning disk about 3.3 weeks. Every design choice below is about cutting the number of block transfers a query needs.
Third, rows have variable length and you will delete them. That last fact is what kills the obvious design.
Think of a page as a single index card you can only ever pick up, scribble on, and put back as a whole card. You cannot edit one word in place on the shelf. You take the card down, rewrite it, and reshelve it. The slotted page is just a smart way to arrange writing on a card so you can fit a lot, erase some, and still point a friend at "the third entry" even after you have recopied the card to clean it up.
Naive fix: append each row to the end of a flat file. Two things break immediately. Deletes leave holes you cannot reuse: the next row is variable length and may not fit the gap, so the file grows forever or fragments. Addressing breaks the moment a row moves: if you point at a row by its byte position and then compact the file to reclaim a hole, every pointer to every row after it is now wrong. An index that stored those positions would all be stale at once. You need a fixed unit you can rewrite whole (the page) and an address that survives the bytes moving (the slot). That is the entire derivation of the slotted page.
Three things called "page"
The word page is overloaded across three layers, and the gap between them is where a real bug lives.
- Hardware page, usually 4 KB. The device guarantees that a write of this size is all-or-nothing.
- OS page, typically 4 KB, the unit of virtual memory.
- Database page, commonly 1 to 16 KB; PostgreSQL fixes 8 KB at compile time, SQLite defaults to 4 KB.
When the database page (8 KB) is bigger than the hardware page (4 KB), one logical write is two atomic writes.
A crash in between leaves a torn page: half the new version, half the old. PostgreSQL defends against this by
writing the full page image to the write-ahead log the first time it is touched after a checkpoint, which is also why
every page header carries pd_lsn, the log position of its last change. We meet that machinery properly in
week 14; for now, just notice the header field is there for crash safety,
not for finding rows.
"The database page size must equal the OS or hardware page size." False. They are three independent numbers. The only consequence of the database page being larger is that you need extra protection (full-page writes) against torn writes.
The slotted page
The slotted page is the layout almost every disk database uses inside a page. The shape is simple and the reason for the shape is the whole lesson.
pd_lower and
pd_upper. The page is full when those two meet. Because slot0 stores tuple0's offset, the engine
can slide tuple0 anywhere in the page and only the offset in slot0 changes.Read the structure top to bottom:
- A header at the top with bookkeeping: where free space starts, where it ends, the crash-recovery LSN.
- A slot array just below it, one small fixed-size entry per row, growing downward as you insert.
- Free space in the middle, shrinking from both ends.
- Tuple data at the bottom, growing upward, with the newest tuple nearest the free gap.
The two halves close toward each other. In PostgreSQL the header field pd_lower marks the end of the
slot array and pd_upper marks the start of tuple data; free bytes are pd_upper - pd_lower
[PG page layout].
A slot stores a tuple's offset, not the tuple. So the record id an index holds is
(page id, slot number), and it stays valid even when the engine slides the tuple's bytes
around inside the page. The indirection is the point.
Suppose an index entry pointed straight at a tuple's byte offset. The moment you compact the page to close a
deletion hole, every tuple after the hole moves, and every index pointing at those tuples is wrong. You would have
to rewrite the index on every compaction. The slot array is one cheap level of indirection that absorbs all that
motion: tuples move, slot offsets update, but the external (page, slot) address never changes. CMU's notes
warn that an application "cannot rely on these ids to mean anything"; they are internal addresses, deliberately opaque
[CMU 15-445 L03].
Try it: a slotted page you can edit
This is one page. Insert rows of a chosen length, delete by slot, and watch the slot array and tuple data close toward each other. Delete a middle row and you get a hole; press compact and the tuples slide together while the record ids in the slot column stay put. Toggle MAXALIGN to see padding inflate every row, and reorder the demo columns to watch a badly ordered row waste pad bytes that a good order avoids.
What a record id actually is
When the index in week 6 returns a match for
salary > 75000, it does not return a row. It returns a record id, called a TID
(tuple id) in PostgreSQL: the pair (page id, slot number). To fetch the row the access
method reads that page, indexes into the slot array at that slot number, follows the stored offset, and decodes the bytes.
One page read, then constant-time arithmetic.
"A record id is a byte offset you can do arithmetic on." False. It is (page, slot). The slot indirects
to the real offset, which can move on compaction. The MCQ loves to offer "the second part of a TID is the tuple's
byte position in the file" as a plausible wrong answer.
PostgreSQL's line pointer, exactly
PostgreSQL's slot is a 4-byte line pointer (ItemIdData) packed as bitfields
[itemid.h]:
typedef struct ItemIdData {
unsigned lp_off:15, /* byte offset of the tuple within the page */
lp_flags:2, /* state: UNUSED, NORMAL, REDIRECT, DEAD */
lp_len:15; /* length of the tuple in bytes */
} ItemIdData;
Two details matter for the exam. The four lp_flags states are LP_UNUSED (free for reuse),
LP_NORMAL (points at a live tuple), LP_REDIRECT (forwards to a newer version on the same page,
used by HOT update chains, which you meet in week 13), and LP_DEAD
(the tuple is gone, the slot awaits reclamation by VACUUM, week 3).
And lp_off is 15 bits, so it can address offsets up to 215 = 32768. That is why a PostgreSQL heap
page cannot exceed 32 KB: the line pointer simply cannot name a byte past that.
SQLite reaches the same destination by a different road. It has no heap; every table is a B-tree, and each B-tree page uses a cell pointer array growing down with cell content growing up. The same slotted invariant, different name [SQLite file format]. The slotted page is convergent design: the constraint (fixed block, variable rows, stable addresses) forces roughly the same answer everywhere.
Inside one tuple
A tuple is "essentially a sequence of bytes" that only the DBMS knows how to decode. The schema lives in the catalog, not in the row. A heap tuple in PostgreSQL carries, in order: a fixed header, an optional NULL bitmap, then the column data.
NULLs are a bitmap, not a value
How do you store NULL? Not as zero, and not as a sentinel, because any sentinel you pick could be a legal value
someone actually stored. PostgreSQL keeps one bit per column in the header (present only when the
HEAP_HASNULL flag is set): the bit says present or null, and a null column then occupies no bytes
at all in the data area [PG page layout].
"A NULL is stored as a special zero or sentinel in the column's bytes." False. It is a bit in the header bitmap, and the column contributes zero data bytes. The trap relies on you imagining a column always occupies its declared width.
Alignment makes column order matter
CPUs read aligned words faster, and some architectures require it, so the DBMS pads fields to natural boundaries. In PostgreSQL the rule is MAXALIGN, typically 8 bytes. The consequence is concrete and testable: the same columns in a different order can take a different number of bytes on disk.
| Order | Layout | Pad | Row bytes (data area) |
|---|---|---|---|
| bad | bool (1) , bigint (8) , name | 7 after the bool | larger |
| good | bigint (8) , bool (1) , name | 0 interior | smaller |
Put the 1-byte bool first and the 8-byte bigint must start on an 8-byte boundary, so 7 pad
bytes are wasted between them. Put the bigint first and the gap vanishes. Multiply by millions of rows and
it is real disk. The simulator's "Swap column order" button shows exactly this.
The constraint is the CPU, not the disk. A misaligned 8-byte load either traps (some architectures) or costs extra cycles (others), and the storage layer would rather burn a few bytes than make every read of that column slow. So it aligns. The bytes are the price of fast in-memory access once the page is in the buffer pool. The cure is free: order columns widest-first and the pad disappears.
Deeper: the 24-byte page header and the 23-byte tuple header
PostgreSQL's PageHeaderData is 24 bytes. The fields that matter here:
pd_lsn (8 bytes, last WAL change, gates flushing), pd_lower (2 bytes, end of the line
pointer array), pd_upper (2 bytes, start of tuple data), plus checksum, flags, special-space offset, and
version. Free space is exactly pd_upper - pd_lower, which is the slotted invariant written as two header
fields [PG docs].
The heap tuple header (HeapTupleHeaderData) has a documented minimum of 23 bytes: t_xmin
and t_xmax (4 each, the inserting and deleting transaction ids, where t_xmax is the
tombstone you will study next week), t_ctid (6 bytes, pointer to this or a newer version),
t_infomask2 and t_infomask (2 each, flags including HEAP_HASNULL), and
t_hoff (1 byte, the offset to user data, itself a multiple of MAXALIGN). The NULL bitmap, when present,
sits right after this fixed header and before user data.
Costs of the slotted operations
| Operation | Cost | Why |
|---|---|---|
| Fetch by record id | 1 page read + O(1) | index the slot array, follow the offset |
| Insert (room exists) | O(1) | append data at pd_upper, append a slot at pd_lower |
| Delete | O(1) | mark the slot dead, leave the bytes for later |
| Compact | O(tuples on page) | slide live tuples, rewrite their slot offsets |
Check yourself
In a slotted page, which statement is false?
False is the third option. The slot array grows down from the header while tuple data grows up from the page end; they close toward each other (pd_lower up, pd_upper down) and the page is full when they meet. Same-direction growth would make "free space is the gap between them" meaningless.
A B+tree index returns a match. What does the record id (TID) it hands back actually contain?
It is (page id, slot number). The slot indirects to the in-page offset, which can change when the page is compacted. Storing the byte offset directly (option D) would break every index entry the moment a tuple moved, which is the whole reason the slot indirection exists.
How does PostgreSQL store a NULL in a row?
Correct is the last option. A NULL bitmap (present only when HEAP_HASNULL is set) marks the column null and that column then occupies no data bytes. A sentinel value cannot work because any value you pick could be a legal stored value.
Which statement about alignment and column order is false?
The first option is false. Column order changes row size precisely because of alignment padding: a bool then a bigint wastes 7 pad bytes that a bigint then a bool avoids. The other four are all true.
Why can a PostgreSQL heap page not exceed 32 KB?
Correct is the third option. lp_off is a 15-bit bitfield, so it can name byte offsets only up to 2^15 = 32768. A larger page would have bytes the line pointer cannot address. The hardware page is unrelated (usually 4 KB), and there is no fixed 32 KB tuple region.
The database page is 8 KB and the hardware page is 4 KB. What is the risk this creates?
Correct is the second. The device guarantees atomicity only per hardware page, so one 8 KB logical write becomes two 4 KB atomic writes; a crash between them tears the page. PostgreSQL defends with WAL full-page writes, which is why pd_lsn is in the header.
Active recall
The authoritative account of the 8 KB page, its five parts, the 24-byte page header (pd_lsn, pd_lower, pd_upper), the 4-byte line pointers, and the heap tuple header with its NULL bitmap and the MAXALIGN rule. Read the page-structure table and the HeapTupleHeaderData table; they are the ground truth for everything in this lesson.
Want to see the actual bytes? Ask me to walk a real pageinspect dump of a heap page, or to derive the
exact row size for a table you describe, or to redraw the slotted page as an animation. I am your teacher for this
course, not just a document.