Distributed Systems

When One Second Crashed the Internet: The Hidden Complexity of Timekeeping

At 23:59:60 UTC on June 30, 2012, a second was added to the world’s clocks. Within minutes, Reddit, LinkedIn, Mozilla, Gawker, and dozens of other major websites had crashed. Their servers were running at 100% CPU, locked in tight loops that made them completely unresponsive. The culprit wasn’t a cyberattack or a hardware failure—it was the handling of a single extra second. The Linux kernel’s high-resolution timer subsystem, called hrtimer, had gotten confused by the leap second. When the system clock stepped backward by one second, sleeping processes were awakened prematurely, flooding the CPU with activity. Java-based applications like Cassandra—the database powering Reddit—were particularly affected. The site was offline for over an hour. ...

When Round Robin Fails: The Hidden Mathematics of Load Balancing Algorithms

Imagine you’re running a service with 10 servers, each capable of handling 1,000 requests per second. You set up a round-robin load balancer—simple, elegant, fair. Every server gets its turn in sequence. Traffic flows smoothly until suddenly, at 2 AM, your monitoring alerts start screaming. Half your servers are overwhelmed, queues are growing, latencies are spiking, and the other half of your servers are nearly idle. What went wrong? The servers weren’t identical. Three of them were newer machines with faster CPUs and more memory. Three were legacy boxes running older hardware. The round-robin algorithm, in its mechanical fairness, sent exactly the same number of requests to a struggling legacy server as it did to a powerful new one. The legacy servers couldn’t keep up, requests piled up in their queues, and eventually they started timing out—cascading into a partial outage that woke up half your engineering team. ...

When Two Nodes Cannot Agree: The FLP Impossibility That Defines Distributed Systems

In 1985, three researchers—Michael Fischer, Nancy Lynch, and Michael Paterson—published a result that would fundamentally reshape how we think about distributed systems. Their theorem, now known simply as FLP, demonstrated something unsettling: in an asynchronous distributed system where even a single process can fail, there exists no deterministic algorithm that is guaranteed to solve consensus. This wasn’t a limitation of current technology or a gap in our knowledge. It was a mathematical impossibility—a fundamental boundary that no amount of engineering cleverness can overcome. Yet today, distributed databases coordinate across continents, consensus algorithms power everything from cloud infrastructure to blockchain networks, and systems achieve agreement millions of times per second. How do we reconcile this apparent contradiction? ...

How NTP Keeps the World Synchronized: The Hidden Protocol Behind Every Network Clock

On June 30, 2012, at 23:59:60 UTC, something unusual happened. A single extra second was added to the world’s clocks to account for the Earth’s gradually slowing rotation. Within minutes, Reddit went offline. LinkedIn stopped responding. Mozilla’s servers ground to a halt. Qantas Airways reported that their check-in systems had failed, stranding passengers across Australia. The culprit wasn’t a cyberattack or a hardware failure. It was a bug in how Linux handled leap seconds—a feature that had been tested only a handful of times in the previous decade. The Network Time Protocol (NTP) had warned servers about the incoming leap second, but the kernel’s high-resolution timer subsystem got confused. Applications that were “sleeping” suddenly woke up all at once, overwhelming CPUs. ...

When One Slow Service Took Down an Entire Region: The Circuit Breaker Pattern Explained

On September 20, 2015, Amazon DynamoDB in US-East-1 went dark for over four hours. The root cause wasn’t a hardware failure or a cyberattack—it was a feedback loop. Storage servers couldn’t retrieve their partition assignments from a metadata service, so they retried. The metadata service became overwhelmed. More timeouts. More retries. More overload. Engineers eventually had to firewall the metadata service from storage servers entirely, effectively taking DynamoDB offline to break the cycle. ...

When Serializable Is Not Serializable: The Hidden World of Transaction Isolation Levels

In 2012, a team of database researchers published a paper that would reshape how engineers think about transaction isolation. The paper, titled “Serializable Snapshot Isolation in PostgreSQL,” described a subtle anomaly that had been hiding in plain sight for decades: two transactions could both execute correctly in isolation, yet produce an incorrect result when run concurrently. The anomaly wasn’t a dirty read or a phantom—it was something called write skew, and it exposed a fundamental truth about the ANSI SQL isolation levels: the names don’t always mean what developers think they mean. ...

How Bloom Filters Store 100 Million Items in 120 MB While Never Missing a Match

In 1970, Burton Howard Bloom faced a problem that would feel familiar to any modern software engineer working with large datasets. He needed to check whether words required special hyphenation rules, but storing 500,000 dictionary entries in memory was prohibitively expensive. His solution—a data structure that uses dramatically less space than any traditional approach—became one of the most widely deployed probabilistic data structures in computing history. The insight was radical: what if you could trade certainty for space? A Bloom filter will never tell you an item is absent when it’s actually present (no false negatives), but it might occasionally claim an item exists when it doesn’t (false positives). For many applications, this trade-off is not just acceptable—it’s transformative. ...

How Git Actually Stores Your Code: The Hidden Architecture Behind Every Commit

On April 7, 2005, Linus Torvalds made the first commit to a new version control system. He had started coding it just four days earlier, on April 3rd, after the proprietary tool he had been using for Linux kernel development became unavailable. The kernel community needed something fast, distributed, and capable of handling thousands of contributors. What Torvalds built in those frantic days wasn’t just another version control system—it was a content-addressable filesystem disguised as one. ...

Why Message Queues Are Harder Than They Look: The Hidden Complexity of "Simple" Asynchronous Communication

A financial services company migrated their payment processing to a message queue architecture. The design seemed straightforward: producers publish payment requests, workers consume and process them. Six months later, they discovered their customers had been double-charged for approximately 3% of transactions. The queue was working exactly as configured—the problem was that “working” meant something different than they expected. Message queues occupy a strange position in distributed systems. They appear deceptively simple on the surface: put message in, get message out. But beneath that simplicity lies a maze of trade-offs involving durability, ordering, delivery guarantees, and failure modes. Understanding these trade-offs isn’t academic—it’s the difference between a reliable system and one that silently corrupts data. ...

How Consistent Hashing Scales Distributed Systems: The Mathematics Behind Minimal Rebalancing

When Amazon engineers published the Dynamo paper in 2007, they revealed a technique that had been quietly powering some of the world’s largest distributed systems. The core idea—consistent hashing—originated from a 1997 MIT paper by David Karger and colleagues, but it took a decade before the industry fully embraced its elegance. Today, consistent hashing underpins Apache Cassandra, Amazon DynamoDB, Discord’s messaging infrastructure, Netflix’s content delivery network, and virtually every modern distributed database. The algorithm solves a deceptively simple problem: how do you distribute data across servers when those servers keep joining and leaving? ...

How Search Engines Find a Needle in a 400 Billion-Haystack

When you type a query and hit enter, results appear in under half a second. Behind that instant response lies an engineering marvel: a system that must search through hundreds of billions of documents, score each one for relevance, and return the best matches—all before you can blink. The numbers are staggering. Google’s index contains approximately 400 billion documents according to testimony from their VP of Search during the 2023 antitrust trial. The index itself exceeds 100 million gigabytes. Yet the median response time for a search query remains under 200 milliseconds. ...

Why Your API Collapsed at 2AM: The Rate Limiting Algorithm You Chose Matters

At 2:17 AM on a Tuesday, a major e-commerce platform’s API went down. The incident report later revealed the root cause: a misconfigured rate limiter had allowed a burst of requests through at exactly the boundary between two time windows, overwhelming downstream services. The platform had implemented a fixed window counter—the simplest rate limiting algorithm—and paid the price for its simplicity. Rate limiting seems straightforward: allow N requests per time period. But the algorithm you choose determines not just whether your system survives traffic spikes, but how fairly it treats users, how much memory it consumes, and whether it creates new failure modes you never anticipated. The difference between algorithms isn’t academic—it’s the difference between a system that degrades gracefully and one that cascades into total failure. ...