When Zero-Copy Isn't Zero: The Hidden Copies in Your "Efficient" Code

A file sits on disk. Your application reads it and sends it over the network. Simple enough—but behind this mundane operation hides one of computing’s most persistent performance bottlenecks. In a traditional I/O path, that single file traverses through four distinct memory copies before reaching the network interface. The kernel reads data from disk into a kernel buffer via DMA. The read() system call copies it to user space. The write() system call copies it back to a kernel socket buffer. Finally, DMA transfers it to the NIC. Each copy consumes CPU cycles, memory bandwidth, and cache space. ...

8 min · 1585 words

Why malloc Is Not Just malloc: The Hidden Architecture of Memory Allocators

When a C program calls malloc(1024), what actually happens? The programmer might assume the operating system finds 1024 bytes of free memory and returns a pointer. The reality is far more complex. Modern memory allocators are sophisticated pieces of software that manage virtual memory, minimize fragmentation, optimize for multi-core CPUs, and make trade-offs between speed and memory efficiency that can affect application performance by orders of magnitude. The default allocator on Linux systems—ptmalloc, part of glibc—has evolved over decades. Facebook replaced it with jemalloc. Google developed tcmalloc. Microsoft created mimalloc. Each makes different architectural choices that matter for different workloads. Understanding these choices explains why switching allocators can speed up a database by 30% or reduce memory consumption by half. ...

11 min · 2232 words