Field Guide / Part IV · Index structures / Week 06
Index layer

B+tree invariants, search, insertion, and splits

A balanced binary tree is optimal for comparisons and terrible for disk. Widen each node until one page read makes a few hundred branching decisions and the same height that needed thirty reads now needs three. This lesson derives that shape, then watches it stay balanced as you insert.

By the end you can

  • Explain why the cost that matters for a disk index is page reads, not comparisons, and why that kills the binary search tree.
  • State the four B+tree invariants and use the routing rule Ki < v <= Ki+1 to search by hand.
  • Insert keys with splits, and say exactly why a leaf split copies a key up while an internal split pushes one up.
  • Predict when the tree grows in height (only on a root split) and why most inserts touch one leaf and stop.
  • Spot the classic exam traps: log base 2 versus fanout, the root exemption, and B-tree versus B+tree.

Where this sits on the journey

Our running query is SELECT name FROM emp WHERE salary > 75000 ORDER BY name. By the time control reaches this layer the optimizer (week 11) has already decided whether to scan the whole heap or to use an index on salary, and the executor (week 9) is pulling tuples through an access method. If the chosen path is an index scan, this layer is what the access method runs. It takes a key or a key range and hands back row ids (TIDs), which it gets by descending a B+tree.

What the layer above hands in: a SARG, the searchable predicate salary > 75000, plus a request for matching rows in key order. What this layer hands down: requests for specific index pages by page id, served by the buffer pool (week 5), which in turn reads them off disk through the storage layer (week 2). The B+tree's whole low-I/O argument leans on that buffer pool keeping the top levels resident, so this lesson sits directly on top of the one before it. Next week (week 7) closes the structure: deletion, bulk loading, composite keys, and the hash alternative.

Intuition

Think of an index as a phone book with tabbed dividers. You do not flip page by page. You jump to the divider for the right letter range, then to a sub-divider, then you are on the page. Each divider is a node, each tab is a separator key, and the trick is that one glance at a divider (one page read) eliminates almost everything. A binary tree is a phone book with exactly one tab per page: technically sorted, but you pay a full lookup for every yes-or-no.

Why a wide tree and not a binary tree

An index earns its keep by turning an O(N) heap scan into something sublinear. So the obvious move is a balanced binary search tree: an AVL or red-black tree gives O(log2 N) comparisons, which is about 30 steps for a billion keys. That sounds fine until you remember where the data lives.

Why it works this way

The constraint is the device. A disk or SSD moves a fixed block per I/O, commonly 4 KB or 8 KB, and that single I/O costs five to six orders of magnitude more than a comparison in RAM. A binary tree node holds one key and two child pointers, so one page read buys you exactly one bit of branching and then you throw the rest of the 8 KB away. Thirty page reads for a billion keys is thirty random I/Os. The fix is not a cleverer balance rule. It is to stop wasting the page: pack a few hundred keys and a few hundred child pointers into each node so one page read makes a few-hundred-way decision. With fanout around a few hundred, the height to index N keys is logm N, and three or four levels reach hundreds of millions of rows. The B+tree is a binary search tree reshaped so the unit of work is the page transfer, not the comparison. [CMU 15-445 L08]

binary tree: 1 bit per page read ...continues ~30 levels deep for 10^9 keys ~30 page reads B+tree: hundreds of keys per page read root: ~hundreds of separators linked leaves (all data here) ~3 to 4 page reads
Figure 1. Same billion keys, two shapes. The binary tree spends one page read per bit of branching and goes deep. The B+tree spends one page read per few-hundred-way decision and stays short. Notice the B+tree keeps all data in the leaves; the upper nodes only route.
Height collapses as fanout grows, for the same key count N.
StructureBranching per readHeight for 109 keysReads per lookup
Binary tree2log2 109 ≈ 30~30
B+tree, m = 100100log100 109 ≈ 4.54 to 5
B+tree, m = 500500log500 109 ≈ 3.33 to 4

The four invariants

A B+tree of order m keeps four properties true at all times. They are not decoration; each one is the reason a guarantee holds. [Comer 1979, CMU 15-445 L08]

  1. All data lives in the leaves. Internal nodes hold only separator keys and child pointers. They route a search; they never store a record or a row pointer. This is the one line that separates a B+tree from a plain B-tree, and it is why a separator can be copied rather than moved on a leaf split.
  2. Leaves are linked in key order. Once a search lands on the lower bound of a range, it walks leaf to leaf sideways without climbing back up. PostgreSQL keeps the leaf chain doubly linked so scans run in either direction.
  3. The tree is height-balanced. Every root-to-leaf path has the same length. Balance is a structural invariant maintained by splits and merges, not a periodic rebuild, so the worst-case lookup equals the average-case lookup.
  4. Every non-root node is at least half full. For order m a non-root node holds between ceil(m/2) - 1 and m - 1 keys. This minimum occupancy is what bounds the height: it forces N keys into at most about logm/2 N levels, so the logarithmic I/O bound holds in the worst case, not only on average. The root is exempt and may hold a single key.
Exam trap

"Every node including the root is at least half full" is false. The root is the exception. It can hold as few as one key (two children), or zero keys when the whole tree is a single leaf. A node-fill rule stated without the root carve-out is the trap.

Search: follow Ki < v <= Ki+1

Searching is a descent. At each internal node you binary-search the separators to find the one child whose key range contains your target, then follow that pointer down. The exact rule between adjacent separators Ki and Ki+1 is that a search value v takes the child between them when Ki < v <= Ki+1. PostgreSQL states this inequality verbatim in its nbtree README. [nbtree README] You repeat until you hit a leaf, binary-search the leaf, and you are done. One page read per level, so O(logm N) I/Os. A range scan does this descent once to land on the lower bound, then rides the leaf links rightward until it passes the upper bound. That is exactly how salary > 75000 gets served: descend to the first leaf at or above 75000, then walk leaves.

[ 60000 | 90000 ] [ 40000 ] [ 72000 | 80000 ] [ 110000 ] v=75000 > 60000, <= 90000 72000 < v <= 80000 75000 78000 82000 ... land here, then walk leaves rightward for the range
Figure 2. Descending for salary > 75000. At the root, 75000 falls between 60000 and 90000, so take the middle child. At that internal node it falls between 72000 and 80000, so take that child. The leaf is the lower bound; the range scan then follows the dashed leaf links rightward.

Insertion: fit, or split

Insertion starts as a search: descend to the leaf where the key belongs. Then one of two things happens.

  1. If the leaf has room, write the entry in sorted position and stop. No structural change. Most inserts end here, which is why inserts are cheap on average.
  2. If the leaf is full, split it. Allocate a new leaf, move the upper half of the entries into it, link it into the leaf chain, and copy up the first key of the new right leaf as a separator into the parent. The word is copy: the key still exists down in the leaf, because all data must stay in leaves.
  3. That copied-up separator may overflow the parent internal node. Split the parent too, but here you push up the middle key (move it, do not copy it) into the grandparent, because an internal node only routes and need not keep a copy.
  4. Splits propagate up only as far as overflow continues. The moment a node has room for the incoming separator, propagation stops. If the root itself splits, a new root is allocated with the two halves as children. That is the only way the tree grows taller, and it grows from the top, which is precisely why every leaf stays at the same depth.
Keep this one thing

Leaf split copies the separator up (data must stay in the leaf). Internal split pushes the middle key up (a router keeps no copy). Height changes only when the root splits.

leaf split: COPY up [ ... 50 ... ] 30 40 50 60 70 50 is copied into the parent AND still present in the right leaf internal split: PUSH up [ ... 50 ... ] 20 35 70 85 50 moves up to the parent and is GONE from both children
Figure 3. The two split kinds side by side. On the left the separator 50 is both in the parent and in the right leaf (copy). On the right the separator 50 is in the parent only, removed from both children (push). Confusing these two is the most common B+tree error.

Build a B+tree by hand

Set a small order, then insert keys and watch the structure react. Use the worked sequence button to replay a deterministic run that forces a leaf split, then an internal split, then a new root. The half-full boundary is drawn on every node, the height and I/O counters tick per operation, and a split is announced as copy-up or push-up so you can see the difference rather than take it on faith.

Animated B+tree inserter insert keys, force splits, watch the root grow

Experiment: with m = 4, insert 10, 20, 30, then 40. The fourth key overflows the single leaf and splits it. Keep inserting and find the key that splits the root, the only moment the height counter goes up.

In real systems

PostgreSQL's default index is nbtree, a B+tree built on the Lehman and Yao high-concurrency design. Every page carries a right-link to its right sibling and a high key that upper-bounds the page's keys. A search that arrives during a concurrent split can detect it (the target key now exceeds the page's high key) and follow the right-link to the moved data, so searches need almost no locking. nbtree also uses pivot tuples that exist only to navigate, and suffix truncation drops unneeded trailing columns of a separator at split time to raise fanout. [nbtree README, PG B-Tree Implementation]

Insertion in pseudocode

# Insert key k. Descend, then fit or split. Splits bubble up only on overflow.
def insert(tree, k):
    leaf, path = descend_to_leaf(tree.root, k)   # Ki < v <= Ki+1 at each level
    leaf.put_sorted(k)
    if len(leaf.keys) <= MAX_KEYS:
        return                                   # common case: no structural change
    right = leaf.split_upper_half()              # link right into leaf chain
    sep   = right.keys[0]                       # COPY UP: key stays in the leaf too
    promote(path, leaf, sep, right)

def promote(path, left, sep, right):
    if path.is_empty():                          # left was the root
        tree.root = Internal(keys=[sep], kids=[left, right])   # only place height grows
        return
    parent = path.pop()
    parent.insert(sep, right)
    if len(parent.keys) <= MAX_KEYS:
        return
    up, rnode = parent.split_middle()            # PUSH UP: middle key removed from children
    promote(path, parent, up, rnode)
Deeper: why splits are rare on average

A split divides a full node into two halves, so right after a split each half is about half full. To split again the same node would have to refill from roughly m/2 keys back up to m, which takes about m/2 more inserts. So in a steady stream of inserts, splits are amortized over many cheap inserts, and the analysis gives O(1) amortized structural work per insert on top of the O(logm N) descent. The descent dominates. This is also why fillfactor matters at build time: packing leaves to 90 percent rather than 100 percent leaves slack so the first inserts after a bulk load do not immediately split. [PG B-Tree Implementation]

Intuition

Why is the marginal cost of a real lookup closer to one or two reads than to the full height? Because every search on the index touches the root and the level below it, so those upper pages are always hot and the buffer pool keeps them resident. The only pages that are likely cold are the leaf you finally land on. That is the week 5 buffer pool doing the index a favor: a four-level tree behaves like a one-or-two-I/O structure in practice.

Check yourself

A B+tree on a column with N rows answers a point lookup in how many page reads, and why?

The base of the log is the fanout m, not 2. That is the entire reason a B+tree beats a binary tree on disk: one page read makes a high-fanout decision, so the height is logm N and three or four levels reach hundreds of millions of rows. Saying log base 2 is the classic trap; it is the right answer for a binary tree and the wrong cost model for a paged index.

On a leaf split in a B+tree, what happens to the separator key?

A leaf split copies the separator up. The key has to stay in the leaf because in a B+tree all data lives in the leaves, so the parent gets a copy for routing while the leaf keeps the real entry. Moving it up (option A) is what an internal-node split does, not a leaf split. Confusing copy-up with push-up is the single most common B+tree mistake.

Which statement about B+tree occupancy is false?

The root is exempt from the half-full rule, so option D is false. The root can hold a single key (two children), or zero keys when the tree is one leaf. The other three are true: the minimum occupancy on non-root nodes is exactly what caps height in the worst case, and a split halves a full node so both pieces start around half full.

Which statement about a B+tree is false?

Option A is false and is the B-tree versus B+tree distinction. In a B+tree all data entries live in the leaves; internal nodes hold only separators and route. A plain B-tree may keep data in internal nodes, but the B+tree deliberately does not, which is what makes the leaf chain a clean sorted list for range scans. The other three are core B+tree invariants.

During a long series of inserts, when does a B+tree increase in height?

Height grows only when the root splits. A leaf or internal split below the root just pushes a separator into an existing parent that had room, so the path length is unchanged. When the root overflows there is no parent to absorb the separator, so a new root is allocated above the two halves, and that single event lifts every leaf one level deeper at once. This top-down growth is why all leaves stay at equal depth.

Why is the buffer pool relevant to the I/O cost of a B+tree lookup?

Every search touches the root and the level just below it, so those pages are always hot and the buffer pool keeps them in frames. The leaf you finally land on is the page most likely to be cold. So a four-level tree behaves like a one-or-two-I/O structure in practice. The buffer pool does not rebalance the tree, does not guarantee the whole index fits in memory, and does not reorder random leaf reads into sequential ones.

State the difference between copy-up and push-up, and when each happens.
A leaf split copies its separator up: the key goes into the parent and stays in the right leaf, because data must remain in leaves. An internal split pushes its middle key up: the key moves into the parent and is removed from both children, because a router keeps no copy.
For order m, how many keys may a non-root node hold, and what is the root allowed to hold?
A non-root node holds between ceil(m/2) - 1 and m - 1 keys (the half-full invariant). The root is exempt: it may hold as few as one key, or zero keys when the whole tree is a single leaf.
Primary source
Read this next: CMU 15-445 Lecture 08, Tree Indexes (Fall 2023)

The clearest treatment of the B+tree as an M-way search tree, with fanout as the central parameter and the search, insert, and split algorithms worked step by step. Focus on the M-way framing and the copy-up versus push-up split rules. 15445.courses.cs.cmu.edu/fall2023/notes/08-trees.pdf

Ask your teacher

Want me to trace a specific insert sequence on the board, derive the exact key that splits the root for a given order, or contrast nbtree's right-link search with the textbook descent? Ask me to go deeper or quiz you harder. I am your teacher for this course, not just a document.