Field Guide / Part IV · Index structures / Week 07
Index structures

B+tree deletion, bulk loading, composite keys, and hash indexes

Last week the tree only grew. Now we let it shrink, build it in bulk, key it on several columns, and ask the harder question: when is an ordered tree the wrong tool and a hash table the right one?

By the end you can

  • Delete from a B+tree and restore the half-full invariant by redistributing from a sibling or merging with one, and say when the tree shrinks in height.
  • Explain why bulk loading bottom up beats inserting one key at a time, and what fillfactor buys.
  • Apply the leftmost-prefix rule for a composite key and predict which predicates an index on (a, b) can seek.
  • Contrast static, extendible, and linear hashing, and decide between a hash index and a B+tree for a given workload.
  • Spot the standard exam traps: hash is always faster, composite indexes help any column, the directory always doubles.

This lesson sits in the index layer, the same place as week 6. On journey 1, our running query SELECT name FROM emp WHERE salary > 75000 ORDER BY name reaches us from the executor (week 9): a scan operator asks the access method for the rows matching salary > 75000. The access method (a B+tree on salary) descends to a leaf and walks linked leaves for the range, handing back row ids. Below us, every page we touch is fetched from the buffer pool (week 4), which reads it from the heap file (week 2). Week 6 taught the read and insert paths. This week closes the index layer: what happens on delete, how the index is first built, why column order matters, and the hash alternative that the optimizer (week 11) will sometimes prefer.

There is a connection to journey 2 hiding here. When our UPDATE account SET balance = balance - 100 WHERE id = 42 runs under MVCC (week 13), the old tuple version becomes dead and its index entries must eventually disappear. That is index deletion. The mechanism below is the textbook baseline; the lazy, two-stage version PostgreSQL actually runs is the realworld note.

Intuition

Insert and delete are mirror images. Insert can overfill a node, so it splits and pushes work up. Delete can underfill a node, so it borrows from a neighbor or fuses with one, and that fusion can underfill the parent, so the fix walks up the same way a split does. The tree grows taller only when the root splits, and shorter only when the root is left with a single child.

Deletion: removing a key without breaking balance

Why underflow has to be repaired

The half-full rule is what bounds the height. Every non-root node of order m must hold at least ceil(m/2) - 1 keys. If a node could empty out to one or two keys, the tree could grow a long thin branch and the O(logm N) I/O guarantee would collapse toward O(N). So a delete that drops a node below the minimum is not allowed to stand. The naive alternatives both fail: ignore underflow and the tree degrades into a list; rebuild the whole tree on every delete and you pay O(N) per delete and block every concurrent reader. The real fix is local: pull the node back to minimum using exactly one sibling.

The algorithm

  1. Descend to the target leaf as in a search, and remove the entry.
  2. If the leaf still has at least ceil(m/2) - 1 keys, stop. Most deletes end here.
  3. On underflow, look at an adjacent sibling. If the sibling has more than the minimum, redistribute: move one entry across and update the separator in the parent. Tree shape is unchanged.
  4. If no sibling can spare an entry (every neighbor is exactly at minimum), merge: fuse the underflowing node with a sibling into one node and remove the now-unused separator from the parent.
  5. A merge removes a key from the parent, which can make the parent underflow. So merge and redistribute propagate upward, exactly like split propagation. If the root is left with a single child, that child becomes the new root and the tree loses a level.
Keep this one thing

Redistribute when a sibling is rich, merge when both are poor. A merge can cascade up; the tree shrinks in height only when the root drops to one child, the mirror of growing only when the root splits.

leaf holding only [9] underflows (min is 2 keys for m=4) Redistribute: sibling [3 5 7] is rich parent: |7| 3 5 7 9 one key moves across, separator updated to 7 Merge: sibling [5] is also at minimum parent: || 5 9 two nodes fuse into one, separator removed from parent (parent may now underflow)
Figure 1. The two repairs branch on one question: can any adjacent sibling spare a key? If yes, redistribute and stop. If no, merge, and check the parent. Notice that merge is the only repair that can cascade.

Simulator: a B+tree that shrinks, and a hash that grows

Two structures, one screen. On the left, delete keys from a B+tree (order 4, so minimum 2 keys per leaf) and watch underflow resolve as redistribute or merge, with the tree height counter ticking down on a root collapse. On the right, insert keys into an extendible hash table and watch global and local depth climb and the directory double. The button that matters most is the range query: press it and the hash side lights up every bucket it must scan, while the B+tree descends once and walks its linked leaves. That contrast is the whole point of the lesson.

B+tree delete vs extendible-hash grower delete on the left, insert on the right, then run a range query on both
Experiment: delete keys from the B+tree until a leaf underflows. When the sibling is rich you will see a borrow; when both are minimal you will see a merge that can drop the height. Then press "range query" and compare the I/O counters: the hash pays for every bucket, the tree pays for one descent plus a short leaf walk.

Bulk loading: build it bottom up, not one key at a time

Why one-at-a-time is wrong for a fresh index

Suppose you run CREATE INDEX on a table that already has 10 million rows. Inserting those keys one at a time means 10 million root-to-leaf descents, each touching the buffer pool, and a stream of random splits that leave leaves around half full and scatter the pages. The result is a tree that is taller, sparser, and physically fragmented than it needs to be. The keys are not arriving online; you have all of them at once. So sort them first, then pack.

The algorithm

  1. Sort all the entries by key (an external merge sort if they do not fit in memory, the same one from week 10).
  2. Pack the sorted entries left to right into leaves up to a target density, the fillfactor, linking each leaf to the next. Stop short of full so later inserts have room before they split.
  3. Build the parent level by taking the first key of each leaf as a separator. Repeat upward until one node remains: that is the root.
sorted entries, packed to a fillfactor, parents built from first keys 11 | 21 2 5 8 11 14 17 21 24 27 each leaf left ~10% empty sequential page writes, no splits
Figure 2. The leaves are filled in one sequential left-to-right pass to the fillfactor, then the level above is read off their first keys. No descents, no random splits. PostgreSQL's B-tree fillfactor defaults to 90, leaving a tenth of each leaf free for later in-place inserts [PG B-Tree Implementation].
In real systems

PostgreSQL builds B-tree indexes bottom up from a sort, packing leaves to fillfactor (default 90, range 10 to 100). A read-heavy index that never gets new rows can be packed to 100 for density; a hot index that takes inserts should stay near 90 so a leaf has slack before it splits [PG docs].

Composite keys: lexicographic order and the leftmost prefix

An index on (a, b) is keyed on the pair, sorted the way a dictionary sorts: compare a first, and only when two rows tie on a does b break the tie. That single fact decides every query the index can and cannot accelerate.

Intuition

Think of names in a phone book sorted by (last name, first name). Finding everyone named "Smith" is one seek and a short walk, because all Smiths are contiguous. Finding everyone whose first name is "John" is hopeless: the Johns are scattered through every last-name section. A composite index seeks a contiguous left prefix and nothing else.

leaf order for an index on (a, b): sorted by a, then by b a=1,b=2 a=1,b=9 a=2,b=1 a=2,b=4 a=2,b=8 a=3,b=1 a=3,b=5 a=4,b=2 a=1 contiguous: seekable b=1 appears in two different places: not seekable on b alone
Figure 3. Rows with a=1 sit together, so WHERE a=1 is one seek plus a walk. Rows with b=1 are split across the a=2 and a=3 regions, so WHERE b=1 cannot be served by a seek on this index.
Which predicates an index on (a, b, c) can seek.
Query predicateUsable prefixVerdict
a = 5aseek, then walk
a = 5 AND b = 2a, btighter seek
a = 5 AND b > 2a, then range on bseek plus a range
b = 2 (no a)nonefull index scan or skip the index
a = 5 AND c = 9a only (b is skipped)seek on a, filter c
Exam trap

"A composite index on (a, b) helps a query that filters only on b." False. Lexicographic ordering puts a given b value all over the key space, so there is no contiguous run to seek. Only a left prefix is seekable. Reordering the index to (b, a) is a different index with different abilities; column order is not symmetric.

One more axis pairs with composite keys: clustering. A clustered index defines the physical row order of the table, so there is at most one per table, and a range scan on it reads the heap in mostly sequential order. An unclustered (secondary) index has its own order; its leaves hold the key plus a row pointer, so a range scan reads index leaves in order but then jumps to scattered heap pages, close to one random read per matching row. How well the index order matches the heap order is the clustering factor, and it is what makes a secondary range scan cheap or ruinous [PG docs].

The hash alternative: O(1) points, but no order

Why a second index family exists at all

A B+tree answers a point lookup in O(logm N) page reads. A hash table answers it in expected O(1): hash the key, go straight to the bucket. For a workload that is pure equality on a full key (cache lookups, join keys with no range), that constant beats a logarithm. The price is total: a hash destroys order, so it cannot do ranges, cannot serve ORDER BY without sorting, and cannot use a key prefix. The design question for a hash index is not "how do I look up" but "how does the bucket array grow when the table outgrows it?"

Three ways to grow a hash table

Static, extendible, and linear hashing differ only in how they handle growth.
SchemeHow it growsCost of growth
Static hashingIt does not. Overflow chains lengthen.lookups degrade toward O(chain), only a full rebuild fixes it
Extendible hashingSplit one bucket; double the directory only when local depth equals global depthdirectory can double, but lookups stay O(1)
Linear hashingSplit buckets in round-robin order at a load threshold, no directoryone bucket split at a time, overflow chains absorb temporary skew
Keep this one thing

In extendible hashing, the global depth is how many hash bits the directory reads; a bucket's local depth is how many bits it actually distinguishes. The directory doubles only when a full bucket has local depth equal to global depth. If the bucket's local depth is smaller, two directory slots already point at it, so splitting it needs no new directory.

global depth 2: a full bucket with local depth 2 doubles the directory; local depth 1 does not directory (gd=2) 00 01 10 11 B full, ld=2 (00 only) C, ld=1 (10 and 11) B splitting: ld(2) = gd(2), so gd -> 3, directory doubles C splitting: ld(1) < gd(2), split bucket only, no doubling
Figure 4. Two directory slots (10 and 11) already point at bucket C, so C can split into two without a bigger directory. Bucket B owns a single slot, so its split needs one more directory bit, which doubles the table.
# extendible hash insert, the case that matters
def insert(key):
    b = bucket_for(directory, low_bits(hash(key), global_depth))
    if b.has_room():
        b.add(key); return
    if b.local_depth == global_depth:
        global_depth += 1          # the directory doubles
        directory = double(directory)
    split(b)                       # b.local_depth += 1, redistribute by the new bit
    insert(key)                    # retry
In real systems

PostgreSQL ships a real on-disk hash index, made crash-safe in version 10 by adding WAL logging [PG 10 notes]. It is deliberately narrow: it supports only the = operator, only single-column indexes, no uniqueness checking, and its scans are lossy because only the hash value is stored, so the heap tuple must be rechecked [PG Hash Indexes]. In practice nbtree wins almost everything; the hash index is a niche pick for large equality-only workloads with a good key distribution. SQLite has no hash access method at all: every equality lookup goes through a B-tree [SQLite file format].

The trade-off in one table

The single most exam-worthy comparison in the index layer.
PropertyB+treeHash index
Point lookup (=)O(logm N) I/Oexpected O(1) I/O
Range or inequalityyes, ordered linked leavesno, must scan every bucket
ORDER BY served without a sortyesno
Prefix of a composite keyyes, leftmost prefixno, needs the full key
Worst case under skew or growthbounded by heightlong overflow chains
Exam trap

"A hash index is always faster than a B+tree because it is O(1)." False in context. The O(1) holds only for full-key equality. A hash cannot do a range, an ORDER BY, or a prefix match, and under skew its overflow chains can make even a point lookup slower than a tree. Most real workloads need order somewhere, which is why the default index everywhere is a B+tree.

Deeper: why PostgreSQL does not eagerly merge underflowing B-tree pages

The textbook delete restores the half-full invariant on every underflow with a merge or redistribute. PostgreSQL does not. Its nbtree deletion is lazy and two-stage: a leaf is first marked half-dead with its downlink removed from the parent, then later unlinked from its siblings and fully deleted, and the rightmost page on a level is never deleted to keep traversal simple [nbtree README]. Pages are allowed to drift below half full and space is reclaimed lazily, because eager merging on every delete is expensive and fights the Lehman-Yao concurrent-access scheme from week 6. Before a page would split, nbtree also deduplicates equal keys into a posting list of heap TIDs as a last defense against bloat [PG B-Tree Implementation]. The half-full rule is the correctness baseline these systems optimize around, not always the literal runtime behavior.

Check yourself

A leaf in an order-4 B+tree drops below the minimum after a delete and its only adjacent sibling is also at the minimum. What happens?

Redistribute needs a sibling with a spare entry. When both neighbors are at the minimum, the only repair is a merge, which fuses the nodes and removes the separator above. That removal can underflow the parent, so the repair can cascade up exactly like a split.

When does a B+tree lose a level in height?

Height shrinks only when the root ends up with a single child, which then becomes the new root. That is the mirror of insertion: the tree grows taller only when the root splits. A leaf merge that stops below the root changes node count but not height.

Which query can an index on (region, dept, salary) seek efficiently?

Only a contiguous left prefix is seekable. The fourth query constrains region, then dept, then ranges on salary: a full prefix match ending in a range. The others skip region, the leading column, so the matching rows are scattered across the whole key space and there is no run to seek.

In extendible hashing, when does inserting into a full bucket double the directory?

A bucket whose local depth is below the global depth is pointed at by two or more directory slots, so it can split without a new directory bit. The directory must double only when the splitting bucket owns a single slot, which is exactly the local-depth-equals-global-depth case.

Which statement about PostgreSQL hash indexes is false?

PostgreSQL hash indexes are single-column and do not support uniqueness, so the third statement is false. The others are true: equality-only, crash-safe since v10, and lossy because only the hash value is stored, forcing a heap recheck.

Why does bulk loading produce a better tree than inserting the same keys one at a time?

Bulk loading sorts first, then fills leaves left to right to the fillfactor with sequential writes, building parents from first keys. There are no per-key descents and no random splits, so the tree is denser and shallower. It still respects occupancy (it targets a fillfactor below 100 on purpose), and it sorts rather than avoiding sorting.

State the two repairs for B+tree underflow and the one condition that chooses between them.
Redistribute (borrow one entry from an adjacent sibling, update the parent separator) when a sibling has more than the minimum; merge (fuse with a sibling, drop the separator from the parent) when every adjacent sibling is at the minimum. Merge can cascade up and can shrink the tree's height.
An index on (a, b). Which of these can it seek: a=5; b=7; a=5 AND b>7; b=7 AND a=5?
a=5 yes (prefix). b=7 no (b alone is scattered). a=5 AND b>7 yes (full prefix then a range). b=7 AND a=5 yes, same as a=5 AND b=7 (predicate order in SQL does not matter, only the index column order does).
Primary source
Read this next: CMU 15-445 Lecture 07, Hash Tables (Fall 2023)

The clearest treatment of static, extendible (global and local depth, directory doubling), and linear hashing, and of why a hash loses to a B+tree on range and prefix access. Focus on the extendible-hashing worked example and the depth bookkeeping. 15445.courses.cs.cmu.edu/fall2023/notes/07-hashtables.pdf. Pair it with Lecture 08, Tree Indexes for the deletion, bulk-loading, and leftmost-prefix material.

Ask your teacher

Want me to trace a full delete cascade up to a root collapse, walk a linear-hashing split round, or work out the exact crossover where a hash index beats the B+tree for your workload? Ask me to go deeper, redraw a figure, or quiz you harder. I am your teacher for this course, not just a document.