Data Structures

When 2 MB of Data Can Take Down a Server: The Hidden Mathematics of Hash Collisions

On December 28, 2011, at the 28th Chaos Communication Congress in Berlin, Alexander Klink and Julian Wälde demonstrated something that sent shockwaves through the software industry. With just 2 megabytes of carefully crafted POST data, they kept a single CPU core busy for over 40 minutes. The attack didn’t exploit buffer overflows or SQL injection—it exploited the fundamental mathematics of hash tables. The technique, dubbed HashDoS, works because hash tables have a worst-case performance that’s dramatically different from their average case. When you understand the mathematics behind this vulnerability, you’ll see why it affected virtually every major programming language and why modern hash table implementations look very different from their predecessors. ...

How LSM Trees Write 10x Faster Than B-Trees: The Hidden Architecture Behind Modern Databases

In 1996, Patrick O’Neil and his colleagues at the University of Massachusetts Boston published a paper describing a data structure that would take nearly a decade to find widespread adoption. The Log-Structured Merge-Tree (LSM-Tree) was designed to solve a problem that barely existed at the time: how to efficiently index data when writes vastly outnumber reads. Today, LSM-Trees power the storage engines of Cassandra, RocksDB, LevelDB, HBase, InfluxDB, and countless other systems that handle massive write throughput. Yet the fundamental insight remains surprisingly misunderstood: LSM-Trees don’t just “write faster”—they fundamentally restructure how data moves from memory to disk. ...

Why Your Database Writes Are Slow: The B+ Tree Problem LSM Trees Were Built to Solve

When you insert a row into a database, what actually happens to that data? If you’re using a traditional relational database, the answer involves random disk I/O, page splits, and a fundamental mismatch between how applications write data and how storage media work best. In 1996, Patrick O’Neil and his colleagues at UMass Boston and Digital Equipment Corporation identified this problem and proposed a solution that would eventually power some of the world’s largest databases. ...

How Bloom Filters Store 100 Million Items in 120 MB While Never Missing a Match

In 1970, Burton Howard Bloom faced a problem that would feel familiar to any modern software engineer working with large datasets. He needed to check whether words required special hyphenation rules, but storing 500,000 dictionary entries in memory was prohibitively expensive. His solution—a data structure that uses dramatically less space than any traditional approach—became one of the most widely deployed probabilistic data structures in computing history. The insight was radical: what if you could trade certainty for space? A Bloom filter will never tell you an item is absent when it’s actually present (no false negatives), but it might occasionally claim an item exists when it doesn’t (false positives). For many applications, this trade-off is not just acceptable—it’s transformative. ...

Why Databases Choose B+ Trees Over Hash Tables and B-Trees

When you create an index on a database table, have you ever wondered what data structure actually powers it? The answer is almost always a B+ tree. Not a hash table. Not a regular B-tree. Not a binary search tree. B+ trees have been the default index structure in nearly every major relational database for over five decades—MySQL, PostgreSQL, Oracle, SQL Server, and SQLite all use them. This isn’t coincidence or legacy inertia. It’s the result of fundamental trade-offs between disk I/O patterns, range query efficiency, and storage utilization. ...