B+tree deletion, bulk loading, composite keys, and hash indexes
Last week the tree only grew. Now we let it shrink, build it in bulk, key it on several columns, and ask the harder question: when is an ordered tree the wrong tool and a hash table the right one?
By the end you can
- Delete from a B+tree and restore the half-full invariant by redistributing from a sibling or merging with one, and say when the tree shrinks in height.
- Explain why bulk loading bottom up beats inserting one key at a time, and what fillfactor buys.
- Apply the leftmost-prefix rule for a composite key and predict which predicates an index on (a, b) can seek.
- Contrast static, extendible, and linear hashing, and decide between a hash index and a B+tree for a given workload.
- Spot the standard exam traps: hash is always faster, composite indexes help any column, the directory always doubles.
This lesson sits in the index layer, the same place as
week 6. On journey 1, our running query
SELECT name FROM emp WHERE salary > 75000 ORDER BY name reaches us from the executor
(week 9): a scan operator asks the access method for the rows
matching salary > 75000. The access method (a B+tree on salary) descends to a leaf and
walks linked leaves for the range, handing back row ids. Below us, every page we touch is fetched from the
buffer pool (week 4), which reads it from the heap file
(week 2). Week 6 taught the read and insert paths. This week
closes the index layer: what happens on delete, how the index is first built, why column order matters, and the
hash alternative that the optimizer (week 11) will sometimes prefer.
There is a connection to journey 2 hiding here. When our
UPDATE account SET balance = balance - 100 WHERE id = 42 runs under MVCC
(week 13), the old tuple version becomes dead and its index entries
must eventually disappear. That is index deletion. The mechanism below is the textbook baseline; the lazy,
two-stage version PostgreSQL actually runs is the realworld note.
Insert and delete are mirror images. Insert can overfill a node, so it splits and pushes work up. Delete can underfill a node, so it borrows from a neighbor or fuses with one, and that fusion can underfill the parent, so the fix walks up the same way a split does. The tree grows taller only when the root splits, and shorter only when the root is left with a single child.
Deletion: removing a key without breaking balance
The half-full rule is what bounds the height. Every non-root node of order m must hold at least
ceil(m/2) - 1 keys. If a node could empty out to one or two keys, the tree could grow a long thin
branch and the O(logm N) I/O guarantee would collapse toward O(N). So a delete that drops a node below the
minimum is not allowed to stand. The naive alternatives both fail: ignore underflow and the tree degrades into a
list; rebuild the whole tree on every delete and you pay O(N) per delete and block every concurrent reader. The
real fix is local: pull the node back to minimum using exactly one sibling.
The algorithm
- Descend to the target leaf as in a search, and remove the entry.
- If the leaf still has at least
ceil(m/2) - 1keys, stop. Most deletes end here. - On underflow, look at an adjacent sibling. If the sibling has more than the minimum, redistribute: move one entry across and update the separator in the parent. Tree shape is unchanged.
- If no sibling can spare an entry (every neighbor is exactly at minimum), merge: fuse the underflowing node with a sibling into one node and remove the now-unused separator from the parent.
- A merge removes a key from the parent, which can make the parent underflow. So merge and redistribute propagate upward, exactly like split propagation. If the root is left with a single child, that child becomes the new root and the tree loses a level.
Redistribute when a sibling is rich, merge when both are poor. A merge can cascade up; the tree shrinks in height only when the root drops to one child, the mirror of growing only when the root splits.
Simulator: a B+tree that shrinks, and a hash that grows
Two structures, one screen. On the left, delete keys from a B+tree (order 4, so minimum 2 keys per leaf) and watch underflow resolve as redistribute or merge, with the tree height counter ticking down on a root collapse. On the right, insert keys into an extendible hash table and watch global and local depth climb and the directory double. The button that matters most is the range query: press it and the hash side lights up every bucket it must scan, while the B+tree descends once and walks its linked leaves. That contrast is the whole point of the lesson.
Bulk loading: build it bottom up, not one key at a time
Suppose you run CREATE INDEX on a table that already has 10 million rows. Inserting those keys one
at a time means 10 million root-to-leaf descents, each touching the buffer pool, and a stream of random splits that
leave leaves around half full and scatter the pages. The result is a tree that is taller, sparser, and physically
fragmented than it needs to be. The keys are not arriving online; you have all of them at once. So sort them first,
then pack.
The algorithm
- Sort all the entries by key (an external merge sort if they do not fit in memory, the same one from week 10).
- Pack the sorted entries left to right into leaves up to a target density, the fillfactor, linking each leaf to the next. Stop short of full so later inserts have room before they split.
- Build the parent level by taking the first key of each leaf as a separator. Repeat upward until one node remains: that is the root.
PostgreSQL builds B-tree indexes bottom up from a sort, packing leaves to fillfactor (default 90,
range 10 to 100). A read-heavy index that never gets new rows can be packed to 100 for density; a hot index that
takes inserts should stay near 90 so a leaf has slack before it splits
[PG docs].
Composite keys: lexicographic order and the leftmost prefix
An index on (a, b) is keyed on the pair, sorted the way a dictionary sorts: compare a
first, and only when two rows tie on a does b break the tie. That single fact decides every
query the index can and cannot accelerate.
Think of names in a phone book sorted by (last name, first name). Finding everyone named "Smith" is one seek and a short walk, because all Smiths are contiguous. Finding everyone whose first name is "John" is hopeless: the Johns are scattered through every last-name section. A composite index seeks a contiguous left prefix and nothing else.
a=1 sit together, so WHERE a=1 is one seek plus a
walk. Rows with b=1 are split across the a=2 and a=3 regions, so
WHERE b=1 cannot be served by a seek on this index.| Query predicate | Usable prefix | Verdict |
|---|---|---|
a = 5 | a | seek, then walk |
a = 5 AND b = 2 | a, b | tighter seek |
a = 5 AND b > 2 | a, then range on b | seek plus a range |
b = 2 (no a) | none | full index scan or skip the index |
a = 5 AND c = 9 | a only (b is skipped) | seek on a, filter c |
"A composite index on (a, b) helps a query that filters only on b." False. Lexicographic ordering puts a given
b value all over the key space, so there is no contiguous run to seek. Only a left prefix is seekable.
Reordering the index to (b, a) is a different index with different abilities; column order is not
symmetric.
One more axis pairs with composite keys: clustering. A clustered index defines the physical row order of the table, so there is at most one per table, and a range scan on it reads the heap in mostly sequential order. An unclustered (secondary) index has its own order; its leaves hold the key plus a row pointer, so a range scan reads index leaves in order but then jumps to scattered heap pages, close to one random read per matching row. How well the index order matches the heap order is the clustering factor, and it is what makes a secondary range scan cheap or ruinous [PG docs].
The hash alternative: O(1) points, but no order
A B+tree answers a point lookup in O(logm N) page reads. A hash table answers it in expected O(1): hash
the key, go straight to the bucket. For a workload that is pure equality on a full key (cache lookups, join keys with
no range), that constant beats a logarithm. The price is total: a hash destroys order, so it cannot do ranges,
cannot serve ORDER BY without sorting, and cannot use a key prefix. The design question for a hash index
is not "how do I look up" but "how does the bucket array grow when the table outgrows it?"
Three ways to grow a hash table
| Scheme | How it grows | Cost of growth |
|---|---|---|
| Static hashing | It does not. Overflow chains lengthen. | lookups degrade toward O(chain), only a full rebuild fixes it |
| Extendible hashing | Split one bucket; double the directory only when local depth equals global depth | directory can double, but lookups stay O(1) |
| Linear hashing | Split buckets in round-robin order at a load threshold, no directory | one bucket split at a time, overflow chains absorb temporary skew |
In extendible hashing, the global depth is how many hash bits the directory reads; a bucket's local depth is how many bits it actually distinguishes. The directory doubles only when a full bucket has local depth equal to global depth. If the bucket's local depth is smaller, two directory slots already point at it, so splitting it needs no new directory.
# extendible hash insert, the case that matters
def insert(key):
b = bucket_for(directory, low_bits(hash(key), global_depth))
if b.has_room():
b.add(key); return
if b.local_depth == global_depth:
global_depth += 1 # the directory doubles
directory = double(directory)
split(b) # b.local_depth += 1, redistribute by the new bit
insert(key) # retry
PostgreSQL ships a real on-disk hash index, made crash-safe in version 10 by adding WAL logging
[PG 10 notes]. It is deliberately narrow: it supports
only the = operator, only single-column indexes, no uniqueness checking, and its scans are lossy because
only the hash value is stored, so the heap tuple must be rechecked
[PG Hash Indexes]. In practice nbtree wins
almost everything; the hash index is a niche pick for large equality-only workloads with a good key distribution.
SQLite has no hash access method at all: every equality lookup goes through a B-tree
[SQLite file format].
The trade-off in one table
| Property | B+tree | Hash index |
|---|---|---|
| Point lookup (=) | O(logm N) I/O | expected O(1) I/O |
| Range or inequality | yes, ordered linked leaves | no, must scan every bucket |
| ORDER BY served without a sort | yes | no |
| Prefix of a composite key | yes, leftmost prefix | no, needs the full key |
| Worst case under skew or growth | bounded by height | long overflow chains |
"A hash index is always faster than a B+tree because it is O(1)." False in context. The O(1) holds only for
full-key equality. A hash cannot do a range, an ORDER BY, or a prefix match, and under skew its overflow
chains can make even a point lookup slower than a tree. Most real workloads need order somewhere, which is why the
default index everywhere is a B+tree.
Deeper: why PostgreSQL does not eagerly merge underflowing B-tree pages
The textbook delete restores the half-full invariant on every underflow with a merge or redistribute. PostgreSQL does not. Its nbtree deletion is lazy and two-stage: a leaf is first marked half-dead with its downlink removed from the parent, then later unlinked from its siblings and fully deleted, and the rightmost page on a level is never deleted to keep traversal simple [nbtree README]. Pages are allowed to drift below half full and space is reclaimed lazily, because eager merging on every delete is expensive and fights the Lehman-Yao concurrent-access scheme from week 6. Before a page would split, nbtree also deduplicates equal keys into a posting list of heap TIDs as a last defense against bloat [PG B-Tree Implementation]. The half-full rule is the correctness baseline these systems optimize around, not always the literal runtime behavior.
Check yourself
A leaf in an order-4 B+tree drops below the minimum after a delete and its only adjacent sibling is also at the minimum. What happens?
Redistribute needs a sibling with a spare entry. When both neighbors are at the minimum, the only repair is a merge, which fuses the nodes and removes the separator above. That removal can underflow the parent, so the repair can cascade up exactly like a split.
When does a B+tree lose a level in height?
Height shrinks only when the root ends up with a single child, which then becomes the new root. That is the mirror of insertion: the tree grows taller only when the root splits. A leaf merge that stops below the root changes node count but not height.
Which query can an index on (region, dept, salary) seek efficiently?
Only a contiguous left prefix is seekable. The fourth query constrains region, then dept, then ranges on salary: a full prefix match ending in a range. The others skip region, the leading column, so the matching rows are scattered across the whole key space and there is no run to seek.
In extendible hashing, when does inserting into a full bucket double the directory?
A bucket whose local depth is below the global depth is pointed at by two or more directory slots, so it can split without a new directory bit. The directory must double only when the splitting bucket owns a single slot, which is exactly the local-depth-equals-global-depth case.
Which statement about PostgreSQL hash indexes is false?
PostgreSQL hash indexes are single-column and do not support uniqueness, so the third statement is false. The others are true: equality-only, crash-safe since v10, and lossy because only the hash value is stored, forcing a heap recheck.
Why does bulk loading produce a better tree than inserting the same keys one at a time?
Bulk loading sorts first, then fills leaves left to right to the fillfactor with sequential writes, building parents from first keys. There are no per-key descents and no random splits, so the tree is denser and shallower. It still respects occupancy (it targets a fillfactor below 100 on purpose), and it sorts rather than avoiding sorting.
The clearest treatment of static, extendible (global and local depth, directory doubling), and linear hashing, and of why a hash loses to a B+tree on range and prefix access. Focus on the extendible-hashing worked example and the depth bookkeeping. 15445.courses.cs.cmu.edu/fall2023/notes/07-hashtables.pdf. Pair it with Lecture 08, Tree Indexes for the deletion, bulk-loading, and leftmost-prefix material.
Want me to trace a full delete cascade up to a root collapse, walk a linear-hashing split round, or work out the exact crossover where a hash index beats the B+tree for your workload? Ask me to go deeper, redraw a figure, or quiz you harder. I am your teacher for this course, not just a document.