B+tree invariants, search, insertion, and splits
A balanced binary tree is optimal for comparisons and terrible for disk. Widen each node until one page read makes a few hundred branching decisions and the same height that needed thirty reads now needs three. This lesson derives that shape, then watches it stay balanced as you insert.
By the end you can
- Explain why the cost that matters for a disk index is page reads, not comparisons, and why that kills the binary search tree.
- State the four B+tree invariants and use the routing rule Ki < v <= Ki+1 to search by hand.
- Insert keys with splits, and say exactly why a leaf split copies a key up while an internal split pushes one up.
- Predict when the tree grows in height (only on a root split) and why most inserts touch one leaf and stop.
- Spot the classic exam traps: log base 2 versus fanout, the root exemption, and B-tree versus B+tree.
Where this sits on the journey
Our running query is SELECT name FROM emp WHERE salary > 75000 ORDER BY name. By the
time control reaches this layer the optimizer (week 11) has
already decided whether to scan the whole heap or to use an index on salary, and the
executor (week 9) is pulling tuples through an access
method. If the chosen path is an index scan, this layer is what the access method runs. It takes a key
or a key range and hands back row ids (TIDs), which it gets by descending a B+tree.
What the layer above hands in: a SARG, the searchable predicate salary > 75000, plus
a request for matching rows in key order. What this layer hands down: requests for specific index pages
by page id, served by the buffer pool (week 5), which in turn
reads them off disk through the storage layer (week 2).
The B+tree's whole low-I/O argument leans on that buffer pool keeping the top levels resident, so this
lesson sits directly on top of the one before it. Next week (week
7) closes the structure: deletion, bulk loading, composite keys, and the hash alternative.
Think of an index as a phone book with tabbed dividers. You do not flip page by page. You jump to the divider for the right letter range, then to a sub-divider, then you are on the page. Each divider is a node, each tab is a separator key, and the trick is that one glance at a divider (one page read) eliminates almost everything. A binary tree is a phone book with exactly one tab per page: technically sorted, but you pay a full lookup for every yes-or-no.
Why a wide tree and not a binary tree
An index earns its keep by turning an O(N) heap scan into something sublinear. So the obvious move is a balanced binary search tree: an AVL or red-black tree gives O(log2 N) comparisons, which is about 30 steps for a billion keys. That sounds fine until you remember where the data lives.
The constraint is the device. A disk or SSD moves a fixed block per I/O, commonly 4 KB or 8 KB, and that single I/O costs five to six orders of magnitude more than a comparison in RAM. A binary tree node holds one key and two child pointers, so one page read buys you exactly one bit of branching and then you throw the rest of the 8 KB away. Thirty page reads for a billion keys is thirty random I/Os. The fix is not a cleverer balance rule. It is to stop wasting the page: pack a few hundred keys and a few hundred child pointers into each node so one page read makes a few-hundred-way decision. With fanout around a few hundred, the height to index N keys is logm N, and three or four levels reach hundreds of millions of rows. The B+tree is a binary search tree reshaped so the unit of work is the page transfer, not the comparison. [CMU 15-445 L08]
| Structure | Branching per read | Height for 109 keys | Reads per lookup |
|---|---|---|---|
| Binary tree | 2 | log2 109 ≈ 30 | ~30 |
| B+tree, m = 100 | 100 | log100 109 ≈ 4.5 | 4 to 5 |
| B+tree, m = 500 | 500 | log500 109 ≈ 3.3 | 3 to 4 |
The four invariants
A B+tree of order m keeps four properties true at all times. They are not decoration; each one is the reason a guarantee holds. [Comer 1979, CMU 15-445 L08]
- All data lives in the leaves. Internal nodes hold only separator keys and child pointers. They route a search; they never store a record or a row pointer. This is the one line that separates a B+tree from a plain B-tree, and it is why a separator can be copied rather than moved on a leaf split.
- Leaves are linked in key order. Once a search lands on the lower bound of a range, it walks leaf to leaf sideways without climbing back up. PostgreSQL keeps the leaf chain doubly linked so scans run in either direction.
- The tree is height-balanced. Every root-to-leaf path has the same length. Balance is a structural invariant maintained by splits and merges, not a periodic rebuild, so the worst-case lookup equals the average-case lookup.
- Every non-root node is at least half full. For order m a non-root node holds between ceil(m/2) - 1 and m - 1 keys. This minimum occupancy is what bounds the height: it forces N keys into at most about logm/2 N levels, so the logarithmic I/O bound holds in the worst case, not only on average. The root is exempt and may hold a single key.
"Every node including the root is at least half full" is false. The root is the exception. It can hold as few as one key (two children), or zero keys when the whole tree is a single leaf. A node-fill rule stated without the root carve-out is the trap.
Search: follow Ki < v <= Ki+1
Searching is a descent. At each internal node you binary-search the separators to find the one child
whose key range contains your target, then follow that pointer down. The exact rule between adjacent
separators Ki and Ki+1 is that a search value v takes the child between them when
Ki < v <= Ki+1. PostgreSQL states this inequality verbatim in its nbtree
README. [nbtree
README] You repeat until you hit a leaf, binary-search the leaf, and you are done. One page read per
level, so O(logm N) I/Os. A range scan does this descent once to land on the lower bound, then
rides the leaf links rightward until it passes the upper bound. That is exactly how
salary > 75000 gets served: descend to the first leaf at or above 75000, then walk leaves.
salary > 75000. At the root, 75000 falls
between 60000 and 90000, so take the middle child. At that internal node it falls between 72000 and
80000, so take that child. The leaf is the lower bound; the range scan then follows the dashed leaf
links rightward.Insertion: fit, or split
Insertion starts as a search: descend to the leaf where the key belongs. Then one of two things happens.
- If the leaf has room, write the entry in sorted position and stop. No structural change. Most inserts end here, which is why inserts are cheap on average.
- If the leaf is full, split it. Allocate a new leaf, move the upper half of the entries into it, link it into the leaf chain, and copy up the first key of the new right leaf as a separator into the parent. The word is copy: the key still exists down in the leaf, because all data must stay in leaves.
- That copied-up separator may overflow the parent internal node. Split the parent too, but here you push up the middle key (move it, do not copy it) into the grandparent, because an internal node only routes and need not keep a copy.
- Splits propagate up only as far as overflow continues. The moment a node has room for the incoming separator, propagation stops. If the root itself splits, a new root is allocated with the two halves as children. That is the only way the tree grows taller, and it grows from the top, which is precisely why every leaf stays at the same depth.
Leaf split copies the separator up (data must stay in the leaf). Internal split pushes the middle key up (a router keeps no copy). Height changes only when the root splits.
Build a B+tree by hand
Set a small order, then insert keys and watch the structure react. Use the worked sequence button to replay a deterministic run that forces a leaf split, then an internal split, then a new root. The half-full boundary is drawn on every node, the height and I/O counters tick per operation, and a split is announced as copy-up or push-up so you can see the difference rather than take it on faith.
Experiment: with m = 4, insert 10, 20, 30, then 40. The fourth key overflows the single leaf and splits it. Keep inserting and find the key that splits the root, the only moment the height counter goes up.
PostgreSQL's default index is nbtree, a B+tree built on the Lehman and Yao high-concurrency design. Every page carries a right-link to its right sibling and a high key that upper-bounds the page's keys. A search that arrives during a concurrent split can detect it (the target key now exceeds the page's high key) and follow the right-link to the moved data, so searches need almost no locking. nbtree also uses pivot tuples that exist only to navigate, and suffix truncation drops unneeded trailing columns of a separator at split time to raise fanout. [nbtree README, PG B-Tree Implementation]
Insertion in pseudocode
# Insert key k. Descend, then fit or split. Splits bubble up only on overflow.
def insert(tree, k):
leaf, path = descend_to_leaf(tree.root, k) # Ki < v <= Ki+1 at each level
leaf.put_sorted(k)
if len(leaf.keys) <= MAX_KEYS:
return # common case: no structural change
right = leaf.split_upper_half() # link right into leaf chain
sep = right.keys[0] # COPY UP: key stays in the leaf too
promote(path, leaf, sep, right)
def promote(path, left, sep, right):
if path.is_empty(): # left was the root
tree.root = Internal(keys=[sep], kids=[left, right]) # only place height grows
return
parent = path.pop()
parent.insert(sep, right)
if len(parent.keys) <= MAX_KEYS:
return
up, rnode = parent.split_middle() # PUSH UP: middle key removed from children
promote(path, parent, up, rnode)
Deeper: why splits are rare on average
A split divides a full node into two halves, so right after a split each half is about half full. To split again the same node would have to refill from roughly m/2 keys back up to m, which takes about m/2 more inserts. So in a steady stream of inserts, splits are amortized over many cheap inserts, and the analysis gives O(1) amortized structural work per insert on top of the O(logm N) descent. The descent dominates. This is also why fillfactor matters at build time: packing leaves to 90 percent rather than 100 percent leaves slack so the first inserts after a bulk load do not immediately split. [PG B-Tree Implementation]
Why is the marginal cost of a real lookup closer to one or two reads than to the full height? Because every search on the index touches the root and the level below it, so those upper pages are always hot and the buffer pool keeps them resident. The only pages that are likely cold are the leaf you finally land on. That is the week 5 buffer pool doing the index a favor: a four-level tree behaves like a one-or-two-I/O structure in practice.
Check yourself
A B+tree on a column with N rows answers a point lookup in how many page reads, and why?
The base of the log is the fanout m, not 2. That is the entire reason a B+tree beats a binary tree on disk: one page read makes a high-fanout decision, so the height is logm N and three or four levels reach hundreds of millions of rows. Saying log base 2 is the classic trap; it is the right answer for a binary tree and the wrong cost model for a paged index.
On a leaf split in a B+tree, what happens to the separator key?
A leaf split copies the separator up. The key has to stay in the leaf because in a B+tree all data lives in the leaves, so the parent gets a copy for routing while the leaf keeps the real entry. Moving it up (option A) is what an internal-node split does, not a leaf split. Confusing copy-up with push-up is the single most common B+tree mistake.
Which statement about B+tree occupancy is false?
The root is exempt from the half-full rule, so option D is false. The root can hold a single key (two children), or zero keys when the tree is one leaf. The other three are true: the minimum occupancy on non-root nodes is exactly what caps height in the worst case, and a split halves a full node so both pieces start around half full.
Which statement about a B+tree is false?
Option A is false and is the B-tree versus B+tree distinction. In a B+tree all data entries live in the leaves; internal nodes hold only separators and route. A plain B-tree may keep data in internal nodes, but the B+tree deliberately does not, which is what makes the leaf chain a clean sorted list for range scans. The other three are core B+tree invariants.
During a long series of inserts, when does a B+tree increase in height?
Height grows only when the root splits. A leaf or internal split below the root just pushes a separator into an existing parent that had room, so the path length is unchanged. When the root overflows there is no parent to absorb the separator, so a new root is allocated above the two halves, and that single event lifts every leaf one level deeper at once. This top-down growth is why all leaves stay at equal depth.
Why is the buffer pool relevant to the I/O cost of a B+tree lookup?
Every search touches the root and the level just below it, so those pages are always hot and the buffer pool keeps them in frames. The leaf you finally land on is the page most likely to be cold. So a four-level tree behaves like a one-or-two-I/O structure in practice. The buffer pool does not rebalance the tree, does not guarantee the whole index fits in memory, and does not reorder random leaf reads into sequential ones.
The clearest treatment of the B+tree as an M-way search tree, with fanout as the central parameter and the search, insert, and split algorithms worked step by step. Focus on the M-way framing and the copy-up versus push-up split rules. 15445.courses.cs.cmu.edu/fall2023/notes/08-trees.pdf
Want me to trace a specific insert sequence on the board, derive the exact key that splits the root for a given order, or contrast nbtree's right-link search with the textbook descent? Ask me to go deeper or quiz you harder. I am your teacher for this course, not just a document.