System Design

When 30% of CPU Time Disappears Into JSON.parse(): The Hidden Cost of Human-Readable Serialization

A Go service at a high-traffic company began experiencing mysterious CPU spikes. The flamegraphs revealed something unexpected: 30-40% of CPU time was spent inside json.Marshal and json.Unmarshal. No database queries were slow. No algorithms were inefficient. The serialization layer alone was consuming nearly half the computational budget. This isn’t an anomaly. At scale, the choice of serialization format becomes a first-order performance concern. The difference between JSON and binary formats isn’t a few percentage points—it’s often 5-7x in throughput and 2-3x in payload size. ...

When Round Robin Fails: The Hidden Mathematics of Load Balancing Algorithms

Imagine you’re running a service with 10 servers, each capable of handling 1,000 requests per second. You set up a round-robin load balancer—simple, elegant, fair. Every server gets its turn in sequence. Traffic flows smoothly until suddenly, at 2 AM, your monitoring alerts start screaming. Half your servers are overwhelmed, queues are growing, latencies are spiking, and the other half of your servers are nearly idle. What went wrong? The servers weren’t identical. Three of them were newer machines with faster CPUs and more memory. Three were legacy boxes running older hardware. The round-robin algorithm, in its mechanical fairness, sent exactly the same number of requests to a struggling legacy server as it did to a powerful new one. The legacy servers couldn’t keep up, requests piled up in their queues, and eventually they started timing out—cascading into a partial outage that woke up half your engineering team. ...

When the Power Fails: How WAL Guarantees Your Data Survives Every Crash

In the late 1970s, Jim Gray and his colleagues at IBM Research were working on transaction processing systems that needed to guarantee data integrity even when power failed mid-operation. His solution was elegant in its simplicity: never write data to the main store until you’ve first written it to a log. This principle, formalized in his 1981 paper “The Transaction Concept: Virtues and Limitations,” became known as Write-Ahead Logging, and decades later, it remains the foundation of every major database system. ...

When One Slow Service Took Down an Entire Region: The Circuit Breaker Pattern Explained

On September 20, 2015, Amazon DynamoDB in US-East-1 went dark for over four hours. The root cause wasn’t a hardware failure or a cyberattack—it was a feedback loop. Storage servers couldn’t retrieve their partition assignments from a metadata service, so they retried. The metadata service became overwhelmed. More timeouts. More retries. More overload. Engineers eventually had to firewall the metadata service from storage servers entirely, effectively taking DynamoDB offline to break the cycle. ...

Why One Second Brought Down Cloudflare DNS: The Hidden Complexity of Time

At midnight UTC on January 1, 2017, deep inside Cloudflare’s custom RRDNS software, a number went negative when it should have always been at least zero. This single value caused DNS resolutions to fail across Cloudflare’s global network. The culprit? A leap second—one extra tick of the clock that most people never noticed. The bug revealed a fundamental truth that every programmer eventually learns the hard way: time is not what you think it is. It doesn’t flow uniformly forward. It jumps, skips, and occasionally rewinds. And if your code assumes otherwise, it will break in ways that are nearly impossible to predict. ...

When Pakistan Accidentally Took Down YouTube: The Fragile Trust Model of BGP

On February 24, 2008, at 18:47 UTC, Pakistan Telecom (AS17557) started announcing a more specific route to YouTube’s IP prefix: 208.65.153.0/24. Within minutes, YouTube traffic from around the world was being redirected to Pakistan. The Pakistan government had ordered the ISP to block YouTube domestically, but a configuration error caused the route to propagate globally through PCCW Global (AS3491), their upstream provider. YouTube engineers responded approximately 80 minutes later by announcing even more specific routes (/25 prefixes) to reclaim their traffic. By 21:01 UTC, the hijack was over. But for nearly two hours, a single misconfiguration in one country had effectively stolen one of the world’s most popular websites. ...

Why Semantic Search Fails: The Hidden Geometry of Vector Embeddings

In 2013, Tomas Mikolov and his team at Google published a paper that would fundamentally change how machines understand language. They showed that by training a simple neural network to predict surrounding words, you could learn vector representations where “king” minus “man” plus “woman” approximately equals “queen.” This was the birth of modern word embeddings—a technique that compresses the meaning of words into dense numerical vectors. A decade later, embeddings have become the backbone of virtually every AI application involving text. They power semantic search, recommendation systems, and the retrieval component of RAG (Retrieval-Augmented Generation) architectures. But as organizations deploy these systems at scale, many discover an uncomfortable truth: semantic search often fails in ways that are hard to predict and even harder to debug. ...

Why Your Database Writes Are Slow: The B+ Tree Problem LSM Trees Were Built to Solve

When you insert a row into a database, what actually happens to that data? If you’re using a traditional relational database, the answer involves random disk I/O, page splits, and a fundamental mismatch between how applications write data and how storage media work best. In 1996, Patrick O’Neil and his colleagues at UMass Boston and Digital Equipment Corporation identified this problem and proposed a solution that would eventually power some of the world’s largest databases. ...

How Bloom Filters Store 100 Million Items in 120 MB While Never Missing a Match

In 1970, Burton Howard Bloom faced a problem that would feel familiar to any modern software engineer working with large datasets. He needed to check whether words required special hyphenation rules, but storing 500,000 dictionary entries in memory was prohibitively expensive. His solution—a data structure that uses dramatically less space than any traditional approach—became one of the most widely deployed probabilistic data structures in computing history. The insight was radical: what if you could trade certainty for space? A Bloom filter will never tell you an item is absent when it’s actually present (no false negatives), but it might occasionally claim an item exists when it doesn’t (false positives). For many applications, this trade-off is not just acceptable—it’s transformative. ...

Why SSH Doesn't Just Use Public Keys: The Hidden Architecture of Secure Shell

Every day, millions of developers type ssh user@server without a second thought. The connection establishes, the shell appears, and work begins. But beneath that familiar prompt lies one of the most elegant cryptographic protocols ever designed—a multi-layered system that somehow manages to be both simple enough for daily use and sophisticated enough to withstand decades of scrutiny. The irony is striking: most people assume SSH “just uses public key cryptography.” After all, that’s what the ~/.ssh/id_rsa file is for, right? The reality is far more nuanced. SSH uses public keys for exactly one purpose—authentication—and a completely different mechanism for everything else. Understanding this distinction reveals why SSH has remained the gold standard for remote access since 1995. ...

How Git Actually Stores Your Code: The Hidden Architecture Behind Every Commit

On April 7, 2005, Linus Torvalds made the first commit to a new version control system. He had started coding it just four days earlier, on April 3rd, after the proprietary tool he had been using for Linux kernel development became unavailable. The kernel community needed something fast, distributed, and capable of handling thousands of contributors. What Torvalds built in those frantic days wasn’t just another version control system—it was a content-addressable filesystem disguised as one. ...

Why Message Queues Are Harder Than They Look: The Hidden Complexity of "Simple" Asynchronous Communication

A financial services company migrated their payment processing to a message queue architecture. The design seemed straightforward: producers publish payment requests, workers consume and process them. Six months later, they discovered their customers had been double-charged for approximately 3% of transactions. The queue was working exactly as configured—the problem was that “working” meant something different than they expected. Message queues occupy a strange position in distributed systems. They appear deceptively simple on the surface: put message in, get message out. But beneath that simplicity lies a maze of trade-offs involving durability, ordering, delivery guarantees, and failure modes. Understanding these trade-offs isn’t academic—it’s the difference between a reliable system and one that silently corrupts data. ...