Performance

When 30% of CPU Time Disappears Into JSON.parse(): The Hidden Cost of Human-Readable Serialization

A Go service at a high-traffic company began experiencing mysterious CPU spikes. The flamegraphs revealed something unexpected: 30-40% of CPU time was spent inside json.Marshal and json.Unmarshal. No database queries were slow. No algorithms were inefficient. The serialization layer alone was consuming nearly half the computational budget. This isn’t an anomaly. At scale, the choice of serialization format becomes a first-order performance concern. The difference between JSON and binary formats isn’t a few percentage points—it’s often 5-7x in throughput and 2-3x in payload size. ...

When a 29-Character String Takes 60 Seconds: The Hidden Complexity of Regex Backtracking

On July 20, 2016, Stack Overflow went offline for 34 minutes. The culprit wasn’t a database failure, a network outage, or a cyberattack. It was a regular expression—a tool developers use every day without a second thought. The pattern ^[\s\u200c]+|[\s\u200c]+$ was used to trim whitespace from user-submitted content. When a post containing approximately 20,000 consecutive whitespace characters appeared on the homepage, the regex engine entered a computational spiral that consumed 100% CPU across multiple web servers. ...

From 8-Second Pauses to Sub-Millisecond: The 60-Year Evolution of Garbage Collection

In 1959, John McCarthy was building Lisp at MIT when he encountered a problem that would define decades of programming language design. Programs in Lisp created and destroyed linked structures constantly—lists within lists, functions returning functions, recursive structures that no programmer could feasibly track manually. McCarthy’s solution was to make memory management automatic. He called it “garbage collection,” dedicating just over a page in his seminal paper to describe a mark-and-sweep algorithm that would free programmers from the burden of explicit deallocation. ...

When Two Cores See Different Realities: The Cache Coherence Problem MESI Was Built to Solve

In 1984, researchers at the University of Illinois published a paper that would quietly shape every multicore processor built since. Mark Papamarcos and Janak Patel proposed a solution to a problem that didn’t seem urgent at the time—how to keep data consistent when multiple processors each have their own cache. Today, with CPUs packing dozens of cores, their invention runs on billions of devices, silently orchestrating a dance of state transitions every time one core writes to memory and another needs to read it. ...

How WebAssembly Actually Runs in Your Browser: From Stack Machine to Machine Code

In April 2015, Luke Wagner made the first commits to a new repository called WebAssembly/design, adding a high-level design document for what would become the fourth language of the web. The project emerged from a convergence of efforts: Mozilla’s asm.js experiment had demonstrated that a strictly-typed subset of JavaScript could approach native speeds, while Google’s PNaCl and Microsoft’s efforts in this space had explored similar territory. What none of these projects achieved was cross-browser consensus. WebAssembly was designed from the start as a collaborative effort, with formal semantics written in parallel with its specification. ...

When Zero-Copy Isn't Zero: The Hidden Copies in Your "Efficient" Code

A file sits on disk. Your application reads it and sends it over the network. Simple enough—but behind this mundane operation hides one of computing’s most persistent performance bottlenecks. In a traditional I/O path, that single file traverses through four distinct memory copies before reaching the network interface. The kernel reads data from disk into a kernel buffer via DMA. The read() system call copies it to user space. The write() system call copies it back to a kernel socket buffer. Finally, DMA transfers it to the NIC. Each copy consumes CPU cycles, memory bandwidth, and cache space. ...

When 99% Cache Hit Ratio Means Nothing: The Metrics You're Not Watching

A major e-commerce platform celebrated when their cache hit ratio hit 99.2%. Their dashboard showed beautiful green charts. Three days later, their database collapsed during a flash sale. The cache hit ratio never dropped below 98%. What went wrong? The team optimized for the wrong metric. While their cache served 99% of requests from memory, the 1% that missed were the most expensive queries—complex aggregations, joins across multiple tables, and expensive computations. A cache hit ratio tells you how often you avoid work, not how much work you’re avoiding. ...

From HTML to Pixels: The 100-Millisecond Journey Through the Browser Rendering Pipeline

In 1993, when the first graphical web browser displayed a simple HTML document, the rendering process was straightforward: parse markup, apply basic styles, display text. Today’s browsers execute a far more complex sequence involving multiple intermediate representations, GPU acceleration, and sophisticated optimization strategies. Understanding this pipeline explains why some pages render in under 100 milliseconds while others struggle to maintain 60 frames per second during animations. The browser rendering pipeline consists of five primary stages: constructing the Document Object Model (DOM), building the CSS Object Model (CSSOM), creating the render tree, calculating layout, and painting pixels to the screen. Each stage transforms data from one representation to another, and bottlenecks in any stage cascade through the entire process. ...

Why malloc Is Not Just malloc: The Hidden Architecture of Memory Allocators

When a C program calls malloc(1024), what actually happens? The programmer might assume the operating system finds 1024 bytes of free memory and returns a pointer. The reality is far more complex. Modern memory allocators are sophisticated pieces of software that manage virtual memory, minimize fragmentation, optimize for multi-core CPUs, and make trade-offs between speed and memory efficiency that can affect application performance by orders of magnitude. The default allocator on Linux systems—ptmalloc, part of glibc—has evolved over decades. Facebook replaced it with jemalloc. Google developed tcmalloc. Microsoft created mimalloc. Each makes different architectural choices that matter for different workloads. Understanding these choices explains why switching allocators can speed up a database by 30% or reduce memory consumption by half. ...

Why Your Database Connection Pool of 100 Is Killing Performance

The Oracle Real-World Performance group published a demonstration that should have changed how every developer thinks about connection pools. They took a system struggling with ~100ms average response times and reduced those times to ~2ms—a 50x improvement. They didn’t add hardware. They didn’t rewrite queries. They reduced the connection pool size from 2048 connections down to 96. Most developers configure connection pools based on intuition: more users means more connections, right? A typical production configuration sets the pool to 100, 200, or even 500 connections “just to be safe.” This intuition is precisely backwards. The correct question isn’t how to make your pool bigger—it’s how small you can make it while still handling your load. ...