Fault Tolerance

When Two Nodes Cannot Agree: The FLP Impossibility That Defines Distributed Systems

In 1985, three researchers—Michael Fischer, Nancy Lynch, and Michael Paterson—published a result that would fundamentally reshape how we think about distributed systems. Their theorem, now known simply as FLP, demonstrated something unsettling: in an asynchronous distributed system where even a single process can fail, there exists no deterministic algorithm that is guaranteed to solve consensus. This wasn’t a limitation of current technology or a gap in our knowledge. It was a mathematical impossibility—a fundamental boundary that no amount of engineering cleverness can overcome. Yet today, distributed databases coordinate across continents, consensus algorithms power everything from cloud infrastructure to blockchain networks, and systems achieve agreement millions of times per second. How do we reconcile this apparent contradiction? ...

When One Slow Service Took Down an Entire Region: The Circuit Breaker Pattern Explained

On September 20, 2015, Amazon DynamoDB in US-East-1 went dark for over four hours. The root cause wasn’t a hardware failure or a cyberattack—it was a feedback loop. Storage servers couldn’t retrieve their partition assignments from a metadata service, so they retried. The metadata service became overwhelmed. More timeouts. More retries. More overload. Engineers eventually had to firewall the metadata service from storage servers entirely, effectively taking DynamoDB offline to break the cycle. ...

How RAID Actually Survives Disk Failures: The Mathematics Behind Your Data's Safety Net

In 1987, three researchers at the University of California, Berkeley published a paper that would fundamentally change how we think about data storage. David Patterson, Garth Gibson, and Randy Katz proposed something counterintuitive: instead of buying one expensive, reliable disk drive, why not combine many cheap, unreliable ones into a system more reliable than any single drive could ever be? They called it RAID—Redundant Arrays of Inexpensive Disks. The insight was mathematical, not magical. By distributing data across multiple drives with carefully calculated redundancy, you could achieve both performance and reliability that would be impossible with a single disk. The key was a simple operation that most programmers learn in their first computer science course: XOR. ...