Field Guide / Part II · Storage / Week 02
Storage engine

Storage engine: disk layout, pages, and slotted records

A disk moves whole blocks and a row can be any size. So how do you pack variable-length rows into a fixed-size page, find any one of them in a single read, and keep its address stable even after you shuffle the bytes around? The slotted page is the answer, and once you see the constraint it solves, it stops looking arbitrary.

By the end you can

  • Say why a flat append-only file breaks on deletes and on addressing, and what a page buys you.
  • Draw a slotted page from memory: header, a slot array closing down, tuple data closing up, free space in the gap.
  • Explain why a record id is (page, slot) and not a byte offset, and why that indirection lets tuples move.
  • Decode a tuple's bytes: header, NULL bitmap, fixed and variable fields, and the alignment pad that makes column order matter.
  • Answer MCQ-style traps about pd_lower / pd_upper, line pointer states, and the 32 KB heap page cap.

Where this sits in the engine

Hold the course's running query in your head:

SELECT name FROM emp WHERE salary > 75000 ORDER BY name;

In week 1 we watched that string fall through five components to the bottom of the stack. This week is the bottom. When the executor's scan asks the buffer pool for a page, and the buffer pool reads that page off disk, this layer is what those bytes mean. The storage manager hands the layer above a fixed-size page; inside that page, the layout we build here lets it find each emp row, pull out the name and salary columns, and decide whether salary > 75000.

What gets handed in, and what gets handed down:

So the storage engine's whole job is to be the translator between "rows and columns" up top and "fixed blocks of bytes" down below. Next week (heap files and free space) zooms out from one page to the file of many pages: where a new row goes, and what a delete leaves behind. This week we stay inside a single page.

The problem, stated honestly

Three facts about persistent storage force everything below.

First, the device is block-addressable, not byte-addressable. To read one salary you cannot pull one number; the hardware drags in a whole block. The CMU storage notes put it bluntly: to read a value at some offset "the program first has to load the 4 KB page into memory" [CMU 15-445 L03]. So the unit of work is the page, never the byte.

Second, disk is slow on a scale that is hard to feel. If an L1 cache reference took one second, the same notes say, an SSD read would take about 4.4 hours and a spinning disk about 3.3 weeks. Every design choice below is about cutting the number of block transfers a query needs.

Third, rows have variable length and you will delete them. That last fact is what kills the obvious design.

Intuition

Think of a page as a single index card you can only ever pick up, scribble on, and put back as a whole card. You cannot edit one word in place on the shelf. You take the card down, rewrite it, and reshelve it. The slotted page is just a smart way to arrange writing on a card so you can fit a lot, erase some, and still point a friend at "the third entry" even after you have recopied the card to clean it up.

Why the naive design fails

Naive fix: append each row to the end of a flat file. Two things break immediately. Deletes leave holes you cannot reuse: the next row is variable length and may not fit the gap, so the file grows forever or fragments. Addressing breaks the moment a row moves: if you point at a row by its byte position and then compact the file to reclaim a hole, every pointer to every row after it is now wrong. An index that stored those positions would all be stale at once. You need a fixed unit you can rewrite whole (the page) and an address that survives the bytes moving (the slot). That is the entire derivation of the slotted page.

Three things called "page"

The word page is overloaded across three layers, and the gap between them is where a real bug lives.

database page (8 KB) DBMS logical page OS page (4 KB) virtual memory unit hw page 4 KB hw page 4 KB atomic write unit one 8 KB DB page straddles two atomic writes: torn-page risk
Figure 1. The device promises atomicity only at the hardware page. When the database page is larger, one logical write becomes two physical writes, and a crash between them leaves the page half old, half new. Notice that the three sizes are independent: nothing forces the database page to match the hardware page.

When the database page (8 KB) is bigger than the hardware page (4 KB), one logical write is two atomic writes. A crash in between leaves a torn page: half the new version, half the old. PostgreSQL defends against this by writing the full page image to the write-ahead log the first time it is touched after a checkpoint, which is also why every page header carries pd_lsn, the log position of its last change. We meet that machinery properly in week 14; for now, just notice the header field is there for crash safety, not for finding rows.

Exam trap

"The database page size must equal the OS or hardware page size." False. They are three independent numbers. The only consequence of the database page being larger is that you need extra protection (full-page writes) against torn writes.

The slotted page

The slotted page is the layout almost every disk database uses inside a page. The shape is simple and the reason for the shape is the whole lesson.

page header (pd_lsn, pd_lower, pd_upper, ...) slot0 slot1 slot2 slot array grows down (pd_lower) free space (the gap) tuple0 tuple1 tuple2 tuple data grows up (pd_upper)
Figure 2. Slot i does not contain a tuple; it points at one. The slot array grows down from the header, tuple data grows up from the end, and free space is whatever is left between pd_lower and pd_upper. The page is full when those two meet. Because slot0 stores tuple0's offset, the engine can slide tuple0 anywhere in the page and only the offset in slot0 changes.

Read the structure top to bottom:

  1. A header at the top with bookkeeping: where free space starts, where it ends, the crash-recovery LSN.
  2. A slot array just below it, one small fixed-size entry per row, growing downward as you insert.
  3. Free space in the middle, shrinking from both ends.
  4. Tuple data at the bottom, growing upward, with the newest tuple nearest the free gap.

The two halves close toward each other. In PostgreSQL the header field pd_lower marks the end of the slot array and pd_upper marks the start of tuple data; free bytes are pd_upper - pd_lower [PG page layout].

Keep this one thing

A slot stores a tuple's offset, not the tuple. So the record id an index holds is (page id, slot number), and it stays valid even when the engine slides the tuple's bytes around inside the page. The indirection is the point.

Why a slot and not a direct pointer

Suppose an index entry pointed straight at a tuple's byte offset. The moment you compact the page to close a deletion hole, every tuple after the hole moves, and every index pointing at those tuples is wrong. You would have to rewrite the index on every compaction. The slot array is one cheap level of indirection that absorbs all that motion: tuples move, slot offsets update, but the external (page, slot) address never changes. CMU's notes warn that an application "cannot rely on these ids to mean anything"; they are internal addresses, deliberately opaque [CMU 15-445 L03].

Try it: a slotted page you can edit

This is one page. Insert rows of a chosen length, delete by slot, and watch the slot array and tuple data close toward each other. Delete a middle row and you get a hole; press compact and the tuples slide together while the record ids in the slot column stay put. Toggle MAXALIGN to see padding inflate every row, and reorder the demo columns to watch a badly ordered row waste pad bytes that a good order avoids.

Slotted page byte editor insert, delete, compact; watch ids stay stable while bytes move
Experiment to run: insert three rows, delete slot 1, note the fragmentation counter rise, then compact and confirm slot 0 and slot 2 still name the same rows. Then flip the column order and watch the same data take fewer bytes.

What a record id actually is

When the index in week 6 returns a match for salary > 75000, it does not return a row. It returns a record id, called a TID (tuple id) in PostgreSQL: the pair (page id, slot number). To fetch the row the access method reads that page, indexes into the slot array at that slot number, follows the stored offset, and decodes the bytes. One page read, then constant-time arithmetic.

Exam trap

"A record id is a byte offset you can do arithmetic on." False. It is (page, slot). The slot indirects to the real offset, which can move on compaction. The MCQ loves to offer "the second part of a TID is the tuple's byte position in the file" as a plausible wrong answer.

PostgreSQL's line pointer, exactly

PostgreSQL's slot is a 4-byte line pointer (ItemIdData) packed as bitfields [itemid.h]:

typedef struct ItemIdData {
    unsigned lp_off:15,   /* byte offset of the tuple within the page */
             lp_flags:2,  /* state: UNUSED, NORMAL, REDIRECT, DEAD */
             lp_len:15;   /* length of the tuple in bytes */
} ItemIdData;

Two details matter for the exam. The four lp_flags states are LP_UNUSED (free for reuse), LP_NORMAL (points at a live tuple), LP_REDIRECT (forwards to a newer version on the same page, used by HOT update chains, which you meet in week 13), and LP_DEAD (the tuple is gone, the slot awaits reclamation by VACUUM, week 3). And lp_off is 15 bits, so it can address offsets up to 215 = 32768. That is why a PostgreSQL heap page cannot exceed 32 KB: the line pointer simply cannot name a byte past that.

In real systems

SQLite reaches the same destination by a different road. It has no heap; every table is a B-tree, and each B-tree page uses a cell pointer array growing down with cell content growing up. The same slotted invariant, different name [SQLite file format]. The slotted page is convergent design: the constraint (fixed block, variable rows, stable addresses) forces roughly the same answer everywhere.

Inside one tuple

A tuple is "essentially a sequence of bytes" that only the DBMS knows how to decode. The schema lives in the catalog, not in the row. A heap tuple in PostgreSQL carries, in order: a fixed header, an optional NULL bitmap, then the column data.

tuple header xmin, xmax, t_ctid, t_hoff NULL bitmap pad salary len name (varchar) low address high address pad bytes appear only because column order forces alignment a varchar needs a length prefix; you cannot find the next column without it
Figure 3. Fixed-length columns sit at computable offsets. A variable-length column needs a length prefix, because you cannot find where the next column starts without knowing how long this one is. The red block is pure waste introduced by alignment, and it is avoidable by reordering columns.

NULLs are a bitmap, not a value

How do you store NULL? Not as zero, and not as a sentinel, because any sentinel you pick could be a legal value someone actually stored. PostgreSQL keeps one bit per column in the header (present only when the HEAP_HASNULL flag is set): the bit says present or null, and a null column then occupies no bytes at all in the data area [PG page layout].

Exam trap

"A NULL is stored as a special zero or sentinel in the column's bytes." False. It is a bit in the header bitmap, and the column contributes zero data bytes. The trap relies on you imagining a column always occupies its declared width.

Alignment makes column order matter

CPUs read aligned words faster, and some architectures require it, so the DBMS pads fields to natural boundaries. In PostgreSQL the rule is MAXALIGN, typically 8 bytes. The consequence is concrete and testable: the same columns in a different order can take a different number of bytes on disk.

Same three columns, two orders. The pad bytes are the whole difference.
OrderLayoutPadRow bytes (data area)
badbool (1) , bigint (8) , name7 after the boollarger
goodbigint (8) , bool (1) , name0 interiorsmaller

Put the 1-byte bool first and the 8-byte bigint must start on an 8-byte boundary, so 7 pad bytes are wasted between them. Put the bigint first and the gap vanishes. Multiply by millions of rows and it is real disk. The simulator's "Swap column order" button shows exactly this.

Why padding at all

The constraint is the CPU, not the disk. A misaligned 8-byte load either traps (some architectures) or costs extra cycles (others), and the storage layer would rather burn a few bytes than make every read of that column slow. So it aligns. The bytes are the price of fast in-memory access once the page is in the buffer pool. The cure is free: order columns widest-first and the pad disappears.

Deeper: the 24-byte page header and the 23-byte tuple header

PostgreSQL's PageHeaderData is 24 bytes. The fields that matter here: pd_lsn (8 bytes, last WAL change, gates flushing), pd_lower (2 bytes, end of the line pointer array), pd_upper (2 bytes, start of tuple data), plus checksum, flags, special-space offset, and version. Free space is exactly pd_upper - pd_lower, which is the slotted invariant written as two header fields [PG docs].

The heap tuple header (HeapTupleHeaderData) has a documented minimum of 23 bytes: t_xmin and t_xmax (4 each, the inserting and deleting transaction ids, where t_xmax is the tombstone you will study next week), t_ctid (6 bytes, pointer to this or a newer version), t_infomask2 and t_infomask (2 each, flags including HEAP_HASNULL), and t_hoff (1 byte, the offset to user data, itself a multiple of MAXALIGN). The NULL bitmap, when present, sits right after this fixed header and before user data.

Costs of the slotted operations

Why this layout is fast where it needs to be.
OperationCostWhy
Fetch by record id1 page read + O(1)index the slot array, follow the offset
Insert (room exists)O(1)append data at pd_upper, append a slot at pd_lower
DeleteO(1)mark the slot dead, leave the bytes for later
CompactO(tuples on page)slide live tuples, rewrite their slot offsets

Check yourself

In a slotted page, which statement is false?

False is the third option. The slot array grows down from the header while tuple data grows up from the page end; they close toward each other (pd_lower up, pd_upper down) and the page is full when they meet. Same-direction growth would make "free space is the gap between them" meaningless.

A B+tree index returns a match. What does the record id (TID) it hands back actually contain?

It is (page id, slot number). The slot indirects to the in-page offset, which can change when the page is compacted. Storing the byte offset directly (option D) would break every index entry the moment a tuple moved, which is the whole reason the slot indirection exists.

How does PostgreSQL store a NULL in a row?

Correct is the last option. A NULL bitmap (present only when HEAP_HASNULL is set) marks the column null and that column then occupies no data bytes. A sentinel value cannot work because any value you pick could be a legal stored value.

Which statement about alignment and column order is false?

The first option is false. Column order changes row size precisely because of alignment padding: a bool then a bigint wastes 7 pad bytes that a bigint then a bool avoids. The other four are all true.

Why can a PostgreSQL heap page not exceed 32 KB?

Correct is the third option. lp_off is a 15-bit bitfield, so it can name byte offsets only up to 2^15 = 32768. A larger page would have bytes the line pointer cannot address. The hardware page is unrelated (usually 4 KB), and there is no fixed 32 KB tuple region.

The database page is 8 KB and the hardware page is 4 KB. What is the risk this creates?

Correct is the second. The device guarantees atomicity only per hardware page, so one 8 KB logical write becomes two 4 KB atomic writes; a crash between them tears the page. PostgreSQL defends with WAL full-page writes, which is why pd_lsn is in the header.

Active recall

Why is a record id (page, slot) rather than a byte offset?
So tuples can move inside a page (on compaction) without invalidating any index entry. The slot absorbs the motion: only its stored offset changes, never the external (page, slot) address.
In PostgreSQL, what do pd_lower and pd_upper mark, and what is between them?
pd_lower is the end of the slot (line pointer) array; pd_upper is the start of tuple data. Free space is pd_upper minus pd_lower. The page is full when they meet.
Primary source

The authoritative account of the 8 KB page, its five parts, the 24-byte page header (pd_lsn, pd_lower, pd_upper), the 4-byte line pointers, and the heap tuple header with its NULL bitmap and the MAXALIGN rule. Read the page-structure table and the HeapTupleHeaderData table; they are the ground truth for everything in this lesson.

Ask your teacher

Want to see the actual bytes? Ask me to walk a real pageinspect dump of a heap page, or to derive the exact row size for a table you describe, or to redraw the slotted page as an animation. I am your teacher for this course, not just a document.