LSM Trees
Your fleet of 10K IoT devices reports a sensor reading every second — 10K writes/sec sustained, with bursts to 100K. PostgreSQL handles it for a while, but as the table grows, every INSERT pays for a random B-tree page update plus index maintenance, and write latency starts climbing into the hundreds of milliseconds. Cassandra and RocksDB serve the same workload at a fraction of the latency because they convert every random write into a sequential append. That single architectural choice — and what it costs you on the read path — is the LSM tree.
A Log-Structured Merge-tree (LSM tree) is a write-optimized storage structure used by Cassandra, RocksDB, LevelDB, HBase, and InfluxDB. It trades read amplification for dramatically faster writes by converting random I/O into sequential I/O.
The Core Problem with B-Trees at Write Scale
B-trees store data in fixed-size pages on disk. Every write to an existing page requires a random disk seek to find that page, modify it in-place, and write it back. At high write throughput, random I/O becomes the bottleneck — HDDs especially are ~100× slower for random vs sequential writes, and even NVMe SSDs benefit from sequential write patterns for endurance.
LSM trees eliminate random writes entirely. Every write goes to memory first, then to disk sequentially in sorted batches.
Write Path
flowchart TD
W["Write (k=foo, v=bar)"]
W --> M["Memtable\n(in-memory sorted tree, RAM speed)"]
W --> WAL["WAL — Write-Ahead Log\n(sequential append to disk, crash recovery)"]
M -->|"threshold ~64 MB"| L0["SSTable L0\n(immutable, sorted by key, sequential write)"]
L0 -->|"compaction"| L1["SSTable L1, L2…\n(progressively larger, non-overlapping key ranges)"]Memtable: An in-memory sorted data structure (red-black tree or skip list). Absorbs all writes at RAM speed. Writes are also appended to a WAL for durability — if the process crashes before flushing, the WAL replays the memtable.
SSTable (Sorted String Table): An immutable, sorted-by-key file on disk. Once written, it is never modified — updates and deletes create new entries, not in-place changes. Each SSTable has:
- A data file (key-value pairs sorted by key)
- An index file (sparse index: key → byte offset in data file)
- A Bloom filter (probabilistic: “does this SSTable definitely NOT contain key X?”)
Deletion via tombstone: To delete a key, the memtable records a tombstone marker (key + deletion flag). On read, a tombstone shadows all older versions of that key. Tombstones are physically removed during compaction.
Read Path
Reads are more expensive than writes because the same key may exist in multiple SSTables at different versions.
Read key "foo"
1. Check memtable (most recent writes) → found? return
2. Check L0 SSTables newest → oldest
│ Bloom filter says "definitely not here"? → skip
│ Bloom filter says "maybe here"? → binary search in SSTable
3. Check L1, L2… in order until found or exhaustedRead amplification: In the worst case, every level must be checked. This is why LSM trees have higher read latency than B-trees for point lookups — a B-tree finds any key in O(log n) in a single structure; an LSM tree may check O(L) SSTables.
Bloom filters are the key optimization that makes reads practical. A Bloom filter can say “this key is definitely NOT in this SSTable” in O(1), allowing the database to skip the vast majority of SSTables on a point lookup. See Bloom Filters & HyperLogLog for the mechanics, false positive math, and sizing formulas.
Compaction
Without compaction, the number of SSTables grows unboundedly and reads degrade. Compaction merges SSTables, discards superseded versions and tombstones, and maintains sorted order.
Size-Tiered Compaction (Cassandra default)
Group SSTables by size. When N same-sized SSTables accumulate, merge them into one larger SSTable.
[1MB] [1MB] [1MB] [1MB] → [4MB]
[4MB] [4MB] [4MB] [4MB] → [16MB]- Write-efficient: low write amplification — each key is rewritten fewer times
- Space-inefficient: during compaction, temporary 2× space required (old + new SSTables coexist)
- Read latency variance: large SSTables take longer to scan; Bloom filters help but don’t eliminate the issue
Leveled Compaction (RocksDB / LevelDB default)
SSTables are organized into levels (L0, L1, L2…). Each level has a bounded total size (e.g., L1: 10MB, L2: 100MB, L3: 1GB, 10× per level). Within L1 and above, key ranges across SSTables never overlap.
L0: [a–z] [a–z] [a–z] ← overlapping, any key could be anywhere
L1: [a–f] [g–m] [n–z] ← non-overlapping, bounded size
L2: [a–b] [c–d] … [y–z] ← non-overlapping, 10× largerWhen L0 fills up, its SSTables are merged into L1. When L1 overflows, the affected key range is merged into L2, and so on.
- Read-efficient: at most one SSTable per level (L1+) contains any given key → bounded read amplification
- Higher write amplification: a key may be rewritten O(levels) times as it cascades down
- Space-efficient: no temporary 2× space; only the compacting SSTables are duplicated
| Size-Tiered | Leveled | |
|---|---|---|
| Write amplification | Low | High |
| Read amplification | High | Low |
| Space amplification | High (temp 2×) | Low |
| Best for | Write-heavy (Cassandra time-series) | Read-heavy (RocksDB, general-purpose) |
B-Tree vs LSM Tree
| B-Tree | LSM Tree | |
|---|---|---|
| Write pattern | Random I/O (in-place page update) | Sequential I/O (append-only) |
| Write amplification | High (page splits, COW) | Low (memtable → sequential flush) |
| Read amplification | Low (single tree, O(log n)) | Higher (multiple SSTables + Bloom filters) |
| Space amplification | Low | Moderate (compaction temp space, tombstones) |
| Crash recovery | WAL + redo log | WAL (memtable replay) |
| Update / delete cost | In-place (fast for reads after write) | Tombstone + deferred cleanup via compaction |
| Range scan | ✅ Efficient (leaf linked list) | ✅ Efficient within a single SSTable; cross-SSTable merge needed |
| Used by | PostgreSQL, MySQL (InnoDB), Oracle, SQL Server | Cassandra, RocksDB, LevelDB, HBase, InfluxDB, TiKV |
RocksDB (built on LevelDB, open-sourced by Meta) is the storage engine embedded inside many systems: CockroachDB, TiKV (TiDB), MyRocks (MySQL at Meta), Kafka log storage, and dozens of others. When an interview asks “how does Cassandra handle high write throughput?” — the answer starts with LSM trees.
Write Stalls
A write stall occurs when the memtable fills faster than it can be flushed to disk, or when L0 SSTable count exceeds the compaction threshold. RocksDB will throttle or fully stall incoming writes to prevent unbounded memory growth.
This is a real operational concern at high write throughput — tuning memtable size, flush thread count, and compaction concurrency is necessary for write-heavy workloads.
Interview tip: When discussing high-write-throughput systems, I’d say: “I’d reach for an LSM-backed engine — Cassandra, RocksDB, or HBase — because every write becomes a sequential append to the memtable plus a WAL fsync, and there’s no random page update like in a B-tree.” I’d then call out the cost honestly: reads have to merge multiple SSTables, which is what Bloom filters mitigate by letting the engine skip SSTables that definitely don’t contain the key. The compaction strategy is the next decision — leveled for read-heavy workloads where you need bounded read amplification, size-tiered for write-heavy time-series where write amplification matters more. And I’d flag write stalls as the operational gotcha: if compaction can’t keep up with ingest, the engine throttles writes, which is the real-world failure mode at extreme throughput.
Test Your Understanding
An LSM tree writes are sequential appends. A B+ tree write involves random page updates. Why does this make LSM trees faster for writes?
Sequential vs random I/O. An LSM write appends to the in-memory memtable (a sorted data structure like a skip list — a probabilistic data structure that maintains sorted order with O(log n) insert/search using layered linked lists) and fsyncs a WAL entry (sequential append). No disk page needs to be read-modified-written.
A B+ tree write must: (1) find the correct leaf page (random read if not cached), (2) modify it in place, (3) write the dirty page back (random write), and (4) potentially split the page (more random I/O). On HDDs, random I/O is ~100x slower than sequential. On SSDs the gap is smaller (~10x) but still significant at high throughput.
The cost LSM pays later: Compaction — merging and rewriting SSTables in the background. This is deferred write amplification. LSM trades immediate write speed for background compaction work.
A point read in an LSM tree might check the memtable, then L0, then L1, … up to LN. With 7 levels, that’s 7+ lookups. How do Bloom filters help, and when do they fail?
Bloom filter per SSTable. Before reading an SSTable from disk, check its Bloom filter: “could this key be in this SSTable?” If no → skip entirely (zero I/O). With a 1% false positive rate, you read an SSTable unnecessarily only 1% of the time.
For a point lookup across 7 levels with one SSTable per level: without Bloom filters = 7 disk reads. With Bloom filters = ~1 disk read (the one SSTable that actually contains the key) + 0.07 expected false positive reads.
When Bloom filters fail: Range scans. Bloom filters answer “is this exact key present?” — not “are there any keys between A and B?” For range queries, the LSM tree must check every SSTable that could overlap the range, and Bloom filters can’t help.
You’re choosing between leveled and size-tiered compaction for a Cassandra table. The table receives 500K writes/sec and occasional full-partition reads. Which strategy and why?
Size-tiered compaction (STCS) for this workload. STCS merges SSTables of similar size into larger ones. It has:
- Lower write amplification: Each SSTable is written ~log(N) times across its lifetime, vs ~10× for leveled.
- Higher read amplification: Multiple SSTables may overlap the same key range — a read must check all of them.
- Higher space amplification: Old data coexists with new data until compaction merges them.
With 500K writes/sec, minimizing write amplification is critical — leveled compaction’s 10× write amplification at this rate would require enormous I/O bandwidth.
Leveled compaction (LCS) is better when reads dominate: each level has non-overlapping SSTables, so a point read checks at most one SSTable per level. For a read-heavy table with moderate writes, LCS provides bounded read latency.