B+ Tree
Your application uses auto-increment primary keys and inserts a few thousand rows per second. Throughput looks fine — until you add a second index on created_at, then a third on (user_id, status), and suddenly the database is spending most of its I/O budget rewriting B+ tree pages and the rightmost leaf becomes a write hotspot. Understanding how a B+ tree actually lays out pages, splits, and amplifies writes is what separates “we added an index and it got slow” from a real diagnosis.
A B+ tree is the index structure underneath PostgreSQL, MySQL (InnoDB), Oracle, and SQL Server. Database Indexes covers what queries a B-tree supports and when to use one. This file goes one level deeper: how the structure actually works on disk, why it is fast, and what happens when you write.
B-Tree vs B+ Tree
The name “B-tree” is used by every vendor, but the implementation is always a B+ tree:
| B-tree | B+ tree | |
|---|---|---|
| Data stored at | Every node (internal + leaf) | Leaf nodes only |
| Internal nodes | Keys + data + child pointers | Keys + child pointers (routing only) |
| Leaf nodes | — | Keys + data, linked in a doubly-linked list |
| Range scan | Must traverse tree | Follow leaf linked list — sequential reads |
The linked leaf list is the critical difference. It is what makes range scans and ORDER BY on indexed columns efficient.
Pages: The Unit of I/O
Every node in a B+ tree is exactly one page — a fixed-size block that the database engine reads and writes atomically. The engine never reads a single key; it reads an entire page.
| Database | Page size |
|---|---|
| PostgreSQL | 8 KB (configurable at initdb time) |
| InnoDB (MySQL) | 16 KB |
| SQL Server | 8 KB |
| Oracle | 8 KB (default; 2–32 KB configurable) |
This has two major consequences:
1. Random I/O is expensive because it is always a full page. Updating a single key in an index requires reading the 8 KB page containing it, modifying it in memory, and writing the full 8 KB back. At high write throughput, scattered random page writes become the bottleneck — this is the problem LSM trees solve.
2. Large pages = high fan-out = short trees. The more keys that fit in one page, the fewer levels the tree needs to cover the same number of rows.
Tree Height and Fan-Out
Fan-out is the number of child pointers in one internal node — equivalently, the number of keys per page.
For a PostgreSQL B+ tree on a 4-byte integer column with 6-byte child pointers:
$$\text{fan-out} = \left\lfloor \frac{8192 \text{ bytes}}{4 + 6 \text{ bytes}} \right\rfloor \approx 800 \text{ keys per internal node}$$
Real-world fan-out is lower after accounting for page headers, alignment, and the fill factor (typically 90%) — a practical figure is 200–500 keys per internal node.
| Tree height | Rows covered (fan-out = 400) |
|---|---|
| 1 | 400 |
| 2 | 160,000 |
| 3 | 64,000,000 |
| 4 | 25,600,000,000 |
A 50-million-row table is reachable in 3 I/Os. This is why O(log n) is fast in practice — the log is base 400, not base 2.
Most production B+ tree indexes for OLTP tables are 3–4 levels deep regardless of row count. The root and upper internal nodes stay hot in the buffer pool (database page cache), so only the leaf-level I/O is typically paid on a point lookup.
Node Structure
Internal node (routing only — never contains actual row data)
┌──────────────────────────────────────────────────────┐
│ Page header │ key₁ │ ptr₁ │ key₂ │ ptr₂ │ … │ keyₙ │ ptrₙ │
└──────────────────────────────────────────────────────┘
ptr₀ → subtree with keys < key₁
ptr₁ → subtree with key₁ ≤ keys < key₂
…
Leaf node (contains actual data or row pointers)
┌────────────────────────────────────────────────────────────┐
│ Page header │ prev ptr │ key₁│val₁ │ key₂│val₂ │ … │ next ptr │
└────────────────────────────────────────────────────────────┘
↑ linked list pointer to previous leaf page
↑ linked list pointer to next leaf pageval in a leaf node is either the row’s full column values (clustered index) or a pointer to the row in the heap (secondary index).
Read Path
Point lookup: WHERE id = 42
Root page (in buffer pool — no I/O)
└── Internal page (level 2 — likely in buffer pool)
└── Leaf page (I/O if not cached) → row pointer or data
└── Heap page (I/O for secondary index — random access)Clustered index: The leaf node is the row. No heap fetch. InnoDB always clusters the primary key; the table data is physically sorted by primary key on disk.
Secondary (non-clustered) index: The leaf node holds the primary key value. For InnoDB, finding the full row requires a second B+ tree traversal on the primary key index (“double dip”). For PostgreSQL, it holds a heap tuple ID (page number + offset) for a direct heap fetch.
Index-only scan: If all queried columns are in the index (covering index), the heap fetch is skipped entirely. This is the Index Only Scan in PostgreSQL’s EXPLAIN output and Using index in MySQL’s EXPLAIN Extra column.
Range scan: The database navigates to the first matching leaf page, then follows the linked leaf list sequentially — no tree traversal needed for subsequent pages. Sequential page reads are orders of magnitude faster than random reads.
Write Path
Writing to a B+ tree is more expensive than reading because it may require restructuring pages.
Find the target leaf page
Navigate the tree from root to the correct leaf page — same as a read.
Write to the page (if space available)
Modify the in-memory copy of the leaf page. The page is now dirty and will be flushed to disk by the background writer or at checkpoint time.
Also append the change to the WAL (Write-Ahead Log) before marking the page dirty — this is what makes the write durable and crash-recoverable.
Page split (if the page is full)
When a leaf page has no room for the new key:
Before split (leaf full):
[10 | 20 | 30 | 40 | 50]
After split into two leaf pages:
[10 | 20 | 30] ↔ [40 | 50]
↑
Parent internal node gains a new key (40) and pointerThe split propagates upward: the parent internal node gains a new key and pointer. If the parent is also full, it splits too — ultimately potentially splitting all the way up to the root. A root split creates a new root, increasing tree height by 1.
Fill Factor
Databases leave internal pages partly empty by default (PostgreSQL default: 90% fill factor for leaf pages, 70% for internal). The spare capacity absorbs insertions and delays splits. For append-only or monotonically increasing keys (e.g., auto-increment IDs), a fill factor of 100% is safe — splits only ever happen at the rightmost page.
Monotonically increasing insert patterns (auto-increment, ULIDs, time-sorted IDs) concentrate all writes on the rightmost leaf page. At high insert rates this creates a “right-edge hotspot” where that single page is the write bottleneck. This is why some workloads use random UUIDs as primary keys — distributing writes across all pages — at the cost of more page splits and cache fragmentation.
Write Amplification
A single logical write can generate several physical disk writes:
| Write | What it touches |
|---|---|
| WAL entry | Sequential append — always, for every write |
| Dirty leaf page | The modified page — 1 page minimum |
| Dirty internal page(s) | On page split, parent page(s) also dirtied |
| New sibling page | Created by the split |
| MVCC page version (PostgreSQL) | PostgreSQL writes a new tuple version to the heap rather than updating in place — old version stays for concurrent readers |
A table with 8 indexes amplifies this across 8 B+ trees — every INSERT must update every index. This is the fundamental write overhead of B+ trees and the core reason write-heavy workloads use LSM trees instead.
B+ Tree vs LSM Tree
| B+ Tree | LSM Tree | |
|---|---|---|
| Write I/O pattern | Random (in-place page update) | Sequential (append-only memtable → SSTable flush) |
| Read amplification | Low — single tree, O(log n) with huge fan-out | Higher — multiple SSTables checked per lookup |
| Write amplification | High — page splits, multiple dirty pages, MVCC versions | Low — batched sequential flushes |
| Range scan | ✅ Linked leaf list = sequential page reads | ✅ Within an SSTable; cross-SSTable merge adds cost |
| Space amplification | Low | Moderate (compaction temp space, tombstones) |
| Crash recovery | WAL replay | WAL (memtable replay) |
| Used by | PostgreSQL, MySQL (InnoDB), Oracle, SQL Server | Cassandra, RocksDB, LevelDB, HBase, InfluxDB |
The rule of thumb: B+ tree for read-heavy OLTP, LSM tree for write-heavy time-series or log workloads. Most OLTP databases are read-dominated (10:1 to 100:1 read/write ratio), which is why B+ trees dominate relational databases.
Interview tip: When asked “why is PostgreSQL so fast for point lookups even on a billion-row table?”, I’d say: “Because the index is a B+ tree with a fan-out of 200–500 keys per page, so even a billion rows is reachable in 3 to 4 levels — and the upper levels stay pinned in the buffer pool, so a typical lookup is one disk I/O at the leaf level.” For write-heavy workloads I’d flag the cost: every secondary index multiplies write amplification, page splits dirty multiple pages, and monotonically increasing keys create a right-edge hotspot on the rightmost leaf. That’s the moment to consider an LSM-backed engine or to deliberately scatter inserts with a less ordered key.
Test Your Understanding
A B+ tree index on a billion-row table has a fan-out of 400. How many levels does the tree have, and how many disk I/Os does a point lookup require?
Levels: Each level multiplies capacity by the fan-out. Level 1 (root) = 400 children. Level 2 = 400² = 160K. Level 3 = 400³ = 64M. Level 4 = 400⁴ = 25.6B. So 4 levels are enough for a billion rows.
Disk I/Os: One I/O per level traversed = 4 I/Os in the worst case. But the root and often level 2 are pinned in the buffer pool (they’re accessed on every lookup). In practice: 1-2 disk I/Os for the leaf level + heap fetch. This is why B+ trees are fast even at massive scale — logarithmic depth with a very high base.
You insert rows with an auto-incrementing primary key (1, 2, 3, …). The B+ tree’s rightmost leaf page splits repeatedly. Why is this a problem under concurrent inserts?
Right-edge hotspot. All inserts go to the same rightmost leaf page because keys are monotonically increasing. Under concurrency, all insert threads contend for the lock on that one page, serializing what should be parallel work.
Page splits are also concentrated: the rightmost page fills, splits, and the new rightmost page immediately starts filling. This creates latch contention and write amplification on a single page.
Fixes:
- Use a UUID or ULID (Universally Unique Lexicographically Sortable Identifier — 128-bit, timestamp-prefixed but with enough randomness to scatter) as the primary key to distribute inserts across the tree. Trade-off: random I/O on inserts and larger key size.
- Use hash partitioning to spread inserts across multiple B+ trees.
- If ordering is needed, accept the hotspot and tune the page size and buffer pool for the workload.
You have a B+ tree index on (last_name, first_name). The query is WHERE first_name = ‘Alice’. Does the index help?
No. The B+ tree is sorted by last_name first, then first_name within each last_name group. Without a predicate on last_name (the leftmost column), the index can’t narrow the search — it would have to scan every leaf page. This is the leftmost-prefix rule.
The index helps for: WHERE last_name = 'Smith', WHERE last_name = 'Smith' AND first_name = 'Alice', and WHERE last_name LIKE 'Sm%' — all start from the leftmost column.
A B+ tree stores all data in leaf nodes with leaves linked in a doubly-linked list. An LSM tree stores data in sorted immutable SSTables. Why does this make range scans faster on a B+ tree?
B+ tree: Once you find the starting leaf via tree traversal (O(log n)), you follow the linked list sequentially — each next leaf is one sequential page read. The data is in a single sorted structure.
LSM tree: Data for the same key range may be spread across multiple SSTables at different levels. A range scan must read from all relevant SSTables and merge them (like merge-sort). This is O(k × range_size) where k is the number of SSTables, vs O(range_size) for a B+ tree.
LSM trees mitigate this with compaction (reducing the number of SSTables) and Bloom filters (skipping SSTables that don’t contain the target key), but the fundamental merge cost remains.