When you insert a row into a database, what actually happens to that data? If you’re using a traditional relational database, the answer involves random disk I/O, page splits, and a fundamental mismatch between how applications write data and how storage media work best. In 1996, Patrick O’Neil and his colleagues at UMass Boston and Digital Equipment Corporation identified this problem and proposed a solution that would eventually power some of the world’s largest databases.

The log-structured merge-tree, or LSM tree, wasn’t built to be faster at everything. It was built to be faster at the thing that matters most in modern applications: writes. Understanding why requires looking at what went wrong with the data structures we used before.

The Hidden Cost of B+ Tree Writes

The B+ tree has been the default index structure for relational databases for over five decades, and for good reason. Its sorted, balanced structure makes range queries efficient, and its high fanout minimizes disk reads for lookups. But B+ trees have a fundamental weakness that becomes apparent under write-heavy workloads.

Consider what happens when you insert a record into a B+ tree with 100 million entries. The new record belongs at a specific position in the sorted order, which means a specific leaf page. That page might be anywhere on disk. To insert the record, the database must:

  1. Read the leaf page from disk (random I/O)
  2. Modify it in memory
  3. Write it back to disk (another random I/O)

On a spinning hard drive, each random I/O takes roughly 10 milliseconds due to seek time and rotational latency. Even on modern SSDs, random I/O is slower than sequential I/O due to the way flash memory works. A database handling 1,000 inserts per second to random locations would need to perform 2,000 random I/Os per second—potentially saturating even high-end storage.

The 1996 LSM tree paper quantified this problem using the TPC-A benchmark. Maintaining an index on a high-insert-volume history table with a B+ tree effectively doubled the I/O cost of each transaction. The disk arm became the bottleneck, not the storage capacity. This wasn’t a minor inefficiency—it was a fundamental architectural problem.

The LSM Tree Insight: Embrace Sequential Writes

The core insight behind LSM trees is deceptively simple: stop trying to write data at random locations. Instead, write everything sequentially, and figure out where it belongs later.

An LSM tree treats all writes—inserts, updates, and deletes—as append operations. New data goes into a memory-resident structure first, then gets flushed to disk in sorted batches. The on-disk files are never modified in place. They’re merged and rewritten in the background, always sequentially.

Diagram illustrating compaction of data in a log-structured merge tree
Diagram illustrating compaction of data in a log-structured merge tree

Image source: Wikipedia - Log-structured merge-tree

This approach transforms the write problem. Instead of 2,000 random I/Os per second for 1,000 inserts, an LSM tree might perform sequential writes that are 10-50x more efficient per I/O. The original paper showed that this could reduce disk arm costs by an order of magnitude for write-heavy workloads.

Anatomy of an LSM Tree

Modern LSM implementations have evolved beyond the original two-component design, but the core architecture remains recognizable.

The Memtable: Where Writes Begin

Every write operation first hits the memtable—an in-memory sorted data structure. Common implementations use skip lists (as in RocksDB) or balanced trees. The memtable provides fast O(log n) inserts and lookups, but more importantly, it batches many writes together before they ever touch disk.

The memtable has a size limit, typically configurable from a few megabytes to hundreds of megabytes. Once it reaches this threshold, it becomes immutable and a new memtable is created to accept new writes.

The Write-Ahead Log: Durability Without the Wait

There’s an obvious problem with keeping writes in memory: crashes lose data. LSM implementations solve this with a write-ahead log (WAL). Every operation is appended to the WAL before being applied to the memtable. The WAL is written sequentially, making it fast even on spinning disks.

After a crash, the system replays the WAL to reconstruct the memtable. This is standard database practice, but in LSM trees, it’s particularly elegant because the WAL and the memtable have the same structure—ordered operations that can be replayed efficiently.

SSTables: Sorted Files on Disk

When a memtable is flushed to disk, it becomes an SSTable (Sorted String Table). An SSTable is an immutable, sorted file containing key-value pairs. The term comes from Google’s Bigtable paper, and it captures the essential property: the data is sorted by key, making both lookups and range scans efficient.

SSTables typically include:

  • Data blocks containing the actual key-value pairs
  • An index block for binary search
  • A Bloom filter (more on this later)
  • Optional compression

Once written, an SSTable is never modified. If a key is updated, a new entry is written to a newer SSTable. If a key is deleted, a tombstone marker is written. The most recent value for any key is always in the newest SSTable containing that key.

Levels: Organizing the Chaos

A naive LSM tree would simply accumulate SSTables indefinitely, making reads slower and slower as more files need to be checked. Real implementations organize SSTables into levels, typically numbered L0, L1, L2, and so on.

Level 0 (L0) contains SSTables freshly flushed from memtables. These files may have overlapping key ranges—the memtable is flushed as-is, without coordination with existing files. This means a read might need to check multiple L0 files.

Starting from L1, each level enforces a critical invariant: SSTables within a level have non-overlapping key ranges. A given key can appear in at most one SSTable per level. This dramatically reduces the number of files that need to be checked during a read.

Each level has a size limit, typically 10x larger than the previous level. When L0 accumulates enough SSTables, they’re merged into L1. When L1 exceeds its limit, some of its SSTables are merged into L2, and so on. This cascading merge process is the heart of LSM tree maintenance.

The Write Path: From Memory to Disk

Understanding the write path reveals why LSM trees excel at write-heavy workloads:

Write(key, value):
  1. Append to WAL (sequential disk write)
  2. Insert into memtable (in-memory operation)
  3. Return success to client

The client receives confirmation after two operations: a sequential disk write (the WAL) and an in-memory insert. No random disk I/O is required. No pages need to be read from disk and written back. The write completes in microseconds rather than milliseconds.

When the memtable fills up, a background flush operation converts it to an SSTable:

Flush():
  1. Mark memtable as immutable
  2. Create new memtable for incoming writes
  3. Write memtable contents to new SSTable in L0 (sequential write)
  4. Delete WAL entries older than the flush point

The flush is also sequential—every key-value pair is written once, in sorted order. No random I/O, no page splits, no rebalancing.

The Read Path: The Price of Write Optimization

Fast writes come with a cost: reads become more complex. To find a key, the system must check:

  1. The active memtable
  2. The immutable memtable (if one exists)
  3. L0 SSTables (in reverse chronological order)
  4. SSTables in L1 and beyond (at most one per level)

For a key that exists in the deepest level, this might mean checking a dozen or more structures. This is the read amplification problem inherent in LSM trees.

Read(key):
  1. Check memtable → if found, return
  2. Check immutable memtable → if found, return
  3. For each L0 SSTable (newest to oldest):
     if key in key range, check SSTable
     if found, return
  4. For each level L1 and beyond:
     Find the single SSTable whose key range contains `key`
     Check that SSTable
     if found, return
  5. Return "not found"

Bloom Filters: Eliminating Impossible Lookups

The key insight for optimizing reads is that most lookups can skip most SSTables. A Bloom filter—a probabilistic data structure that can definitively say “this key is NOT in this SSTable”—eliminates most disk reads for lookups.

Before reading an SSTable, the system checks its Bloom filter. If the filter says the key isn’t present, that SSTable is skipped entirely. Bloom filters have a small false positive rate, meaning they might incorrectly suggest a key is present, but they never produce false negatives.

In practice, Bloom filters dramatically reduce read amplification. A point lookup might need to check only a handful of SSTables even when hundreds exist. The memory overhead is modest—a Bloom filter with 10 bits per key has a false positive rate around 1%.

Compaction: The Background Tax

As SSTables accumulate, two problems emerge:

  1. Read performance degrades—more files mean more structures to check
  2. Space efficiency suffers—old versions of keys and tombstones consume space

Compaction is the background process that addresses both problems. It merges multiple SSTables into new ones, discarding overwritten values and removing tombstones.

Leveled Compaction

The most common compaction strategy, used by RocksDB and LevelDB, is called leveled compaction. Each level has a target size, typically 10x the size of the previous level. When a level exceeds its target, the system selects one or more SSTables from that level and merges them with overlapping SSTables in the next level.

For example, if L1 is at capacity, an SSTable covering keys 1000-2000 from L1 would be merged with any L2 SSTables whose key ranges overlap with 1000-2000. The result is new L2 SSTables that maintain the non-overlapping invariant.

Leveled compaction minimizes read amplification—each level has at most one SSTable to check per key. But it maximizes write amplification. A key might be rewritten multiple times as it cascades through the levels.

The theoretical write amplification for leveled compaction is:

$$WA = \frac{L \cdot (T + 1)}{2}$$

Where $L$ is the number of levels and $T$ is the size ratio between levels (typically 10). For a 7-level tree with size ratio 10, this gives a write amplification of about 35—each byte written by the application results in 35 bytes written to disk over time.

Tiered (Size-Tiered) Compaction

An alternative strategy, used by Apache Cassandra, is tiered compaction. Instead of merging SSTables between levels, tiered compaction merges SSTables of similar size within the same tier.

When enough SSTables accumulate at a given size tier, they’re merged together into a larger SSTable. This approach has lower write amplification—each key is rewritten fewer times—but higher read amplification, since multiple SSTables at each tier may need to be checked.

The choice between leveled and tiered compaction represents a fundamental trade-off: optimize for reads (leveled) or optimize for writes (tiered). Modern systems often allow tuning along this spectrum, or even hybrid approaches.

The Three Amplifications: Understanding LSM Trade-offs

LSM tree performance is often analyzed through three metrics, collectively known as amplifications:

Write Amplification

The ratio of bytes written to disk versus bytes written by the application. High write amplification:

  • Consumes disk bandwidth
  • Reduces write throughput
  • Shortens SSD lifespan (due to flash wear)

Leveled compaction trades higher write amplification for lower read amplification. Tiered compaction does the opposite.

Read Amplification

The number of disk reads required for a point lookup. Higher read amplification means slower reads. Leveled compaction typically requires checking at most one SSTable per level, giving read amplification equal to the number of levels.

Space Amplification

The ratio of disk space used versus the logical size of the data. This includes:

  • Multiple versions of the same key
  • Tombstones waiting to be cleaned up
  • SSTables being merged (temporarily)

LSM trees typically have higher space amplification than B+ trees, often 10-50% more. This is the price of immutability.

These three amplifications exist in tension. Reducing one often increases another. The choice of compaction strategy, level sizes, and other parameters is fundamentally about choosing which amplification matters most for your workload.

Real-World Implementations

LSM trees have become the storage engine of choice for systems where writes matter.

RocksDB: The Standard-Bearer

Facebook developed RocksDB as a fork of Google’s LevelDB, optimized for flash storage and multi-core CPUs. It’s used at Facebook for UDB (their social graph database), and as the storage engine for MyRocks, their MySQL variant.

The MyRocks migration is instructive. Facebook moved their social graph workload from InnoDB (B+ tree) to MyRocks (LSM tree) and achieved:

  • 50% reduction in storage space
  • Similar read performance
  • Significantly better write performance

The space savings came from better compression (sequential data compresses better) and the absence of page fragmentation that plagues B+ trees under random updates.

Apache Cassandra: Distributed LSM

Cassandra uses LSM trees as its primary storage structure, with tiered compaction as the default. This choice reflects Cassandra’s origin as a write-heavy, distributed system where write throughput is paramount.

Cassandra’s implementation adds complexity for distribution—each node runs its own LSM tree, and data is partitioned across nodes using consistent hashing. But the core LSM principles remain: writes go to memtables first, flush to SSTables, and merge in the background.

InfluxDB and Time-Series Data

Time-series databases face extreme write volumes—millions of data points per second are common. InfluxDB’s storage engine uses an LSM-inspired design optimized for time-stamped data. Time-series workloads are particularly well-suited to LSM trees because:

  • Newer data is accessed more frequently (matches the LSM read pattern)
  • Writes are typically append-only
  • Time ordering naturally creates sorted runs

When to Choose LSM Over B+ Tree

The choice between LSM and B+ tree isn’t about which is “better”—it’s about which trade-offs match your workload.

Choose LSM when:

  • Write throughput is the bottleneck
  • Your workload is write-heavy (more writes than reads)
  • Sequential I/O patterns are preferred (SSD storage)
  • You can tolerate slightly higher space usage
  • Your application handles occasional read latency spikes during compaction

Choose B+ tree when:

  • Read latency must be consistently low
  • Your workload is read-heavy
  • Range queries are frequent and must be fast
  • Space efficiency is critical
  • You need predictable performance without compaction spikes

The rise of SSDs has made LSM trees more attractive. The random read penalty that hurt B+ tree writes is less severe for LSM tree reads, while the write amplification of compaction is less harmful to flash endurance than the random writes of B+ tree updates.

The Compaction Problem: When Background Work Becomes Foreground Pain

LSM trees have an Achilles’ heel: compaction can become a foreground problem.

When write rates exceed compaction throughput, SSTables accumulate faster than they can be merged. L0 grows uncontrollably, read latency spikes, and eventually the system must throttle writes to let compaction catch up.

This “compaction debt” manifests as:

  • Write stalls when L0 exceeds a threshold
  • Increased read latency due to checking more files
  • Disk I/O saturation from background merges

Modern systems like RocksDB implement sophisticated compaction scheduling to mitigate this. They may:

  • Reserve I/O bandwidth for foreground operations
  • Prioritize compaction of files containing older data
  • Dynamically adjust compaction parallelism
  • Implement “compaction filters” to actively prune data

But the fundamental tension remains: compaction is work that must be done eventually, and doing it lazily means it might interfere with foreground operations.

Looking Forward: LSM Trees in the Modern Era

Research on LSM trees continues to address their limitations. Recent innovations include:

Key-value separation (WiscKey): Large values are stored in a separate log, reducing write amplification when values are much larger than keys.

Learned indexes: Machine learning models predict key locations, potentially reducing the number of levels and amplification.

NVMe-optimized designs: New implementations exploit the parallelism and low latency of NVMe SSDs.

LSM-B+ tree hybrids: Systems like WiredTiger (MongoDB’s storage engine) offer both LSM and B+ tree modes, or hybrid approaches.

The LSM tree isn’t a silver bullet—it’s a careful engineering trade-off that happens to match modern workloads and storage characteristics. Understanding those trade-offs is essential for anyone building or operating database systems.


References

  1. O’Neil, P., Cheng, E., Gawlick, D., & O’Neil, E. (1996). The log-structured merge-tree (LSM-tree). Acta Informatica, 33(4), 351-385.

  2. Matsunobu, M., Dong, S., Lee, H., & et al. (2020). MyRocks: LSM-Tree Database Storage Engine Serving Facebook’s Social Graph. Proceedings of the VLDB Endowment, 13(12), 3217-3230.

  3. Dayan, N., Twitto, M., et al. (2021). Spooky: Granulating LSM-Tree Compactions Correctly. Proceedings of the VLDB Endowment, 15(3), 307-319.

  4. Luo, C., & Carey, M. J. (2019). LSM-based Storage Techniques: A Survey. The VLDB Journal, 29, 393-418.

  5. Sears, R., & Ramakrishnan, R. (2012). bLSM: A General Purpose Log Structured Merge Tree. Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 217-228.

  6. Sarkar, P., et al. (2021). Constructing and Analyzing the LSM Compaction Design Space. Proceedings of the VLDB Endowment, 14(11), 2216-2229.