Why malloc Is Not Just malloc: The Hidden Architecture of Memory Allocators

When a C program calls malloc(1024), what actually happens? The programmer might assume the operating system finds 1024 bytes of free memory and returns a pointer. The reality is far more complex. Modern memory allocators are sophisticated pieces of software that manage virtual memory, minimize fragmentation, optimize for multi-core CPUs, and make trade-offs between speed and memory efficiency that can affect application performance by orders of magnitude.

The default allocator on Linux systems—ptmalloc, part of glibc—has evolved over decades. Facebook replaced it with jemalloc. Google developed tcmalloc. Microsoft created mimalloc. Each makes different architectural choices that matter for different workloads. Understanding these choices explains why switching allocators can speed up a database by 30% or reduce memory consumption by half.

The Fundamental Problem: Memory Fragmentation

Before diving into allocator implementations, it’s essential to understand what they’re trying to solve. Memory fragmentation comes in two forms.

Internal fragmentation occurs when the allocator provides more memory than requested. If you request 17 bytes and the allocator rounds up to 32, those 15 extra bytes are wasted inside your allocation. This seems minor for a single allocation, but millions of such allocations compound into significant waste.

External fragmentation is more insidious. Imagine a 1MB heap where you allocate 1000 objects of 1KB each, then free every other object. You have 500KB free, but it’s scattered across 500 non-contiguous 1KB holes. A request for 10KB will fail despite having 500KB available—the free space isn’t contiguous.

Allocators employ various strategies to combat both forms of fragmentation. Size classes round allocations to fixed sizes, accepting internal fragmentation to prevent external fragmentation. Coalescing merges adjacent free blocks. But these strategies add overhead and complexity.

From sbrk to mmap: How Allocators Get Memory

At the lowest level, allocators request memory from the operating system through two mechanisms.

The traditional approach is sbrk, which expands the heap—a contiguous region of virtual memory that grows upward from the end of the program’s data segment. sbrk(4096) increases the heap by 4KB and returns a pointer to the start of the new region. This is simple and works well for single-threaded programs with moderate memory needs.

Modern allocators prefer mmap (memory map) for several reasons. mmap can allocate memory anywhere in the virtual address space, not just contiguously after the heap. When memory is freed, mmap’d regions can be returned to the operating system immediately, whereas sbrk’d memory is only reclaimed at program termination. Most importantly, mmap provides natural isolation between different regions of memory.

ptmalloc uses a threshold (typically 128KB) to decide: small allocations come from the heap via sbrk, large allocations get their own mmap’d region. jemalloc carves the address space into 4MB chunks and manages them independently. tcmalloc requests memory in 1GB spans from the OS and subdivides them internally.

ptmalloc: The Linux Default

ptmalloc (pthreads malloc) has been glibc’s allocator since 2006. Its architecture reflects the constraints of its era: multi-threading was becoming common, but multi-core systems were still emerging.

Arenas and Lock Contention

The key innovation in ptmalloc was multiple arenas. In the original dlmalloc (Doug Lea’s malloc), all threads shared a single heap protected by one mutex. Under heavy contention, threads spent more time waiting for the lock than allocating memory.

ptmalloc’s solution: give each thread its own arena. When a thread first calls malloc, it’s assigned an arena (round-robin by default, up to a limit). All subsequent allocations from that thread use that arena. Since arenas are independent, threads don’t contend for the same lock.

But there’s a catch. When Thread A allocates memory and Thread B frees it, that memory must return to Thread A’s arena. Thread B must acquire Thread A’s arena lock to perform the free. Cross-thread deallocation creates contention that arenas were meant to avoid. For workloads with producer-consumer patterns, this can negate much of the benefit.

The Bin Hierarchy

Within each arena, ptmalloc organizes free chunks into “bins”—linked lists of available memory blocks.

Fast bins hold the smallest recently-freed chunks (up to ~160 bytes on 64-bit systems). These are singly-linked lists optimized for speed. When you free a small chunk, it goes to the head of the appropriate fast bin. When you allocate, ptmalloc checks the fast bin first. No splitting, no coalescing—just push and pop.

Tcache (thread cache) was added in glibc 2.26 to further reduce arena lock contention. Each thread has a small cache of recently-freed chunks per size class. For most allocations under ~1KB, the thread never touches the arena at all—it serves everything from its tcache.

Small bins hold chunks of specific sizes (power-of-two aligned). Each bin contains chunks of exactly one size, making allocation a simple matter of taking the first chunk from the appropriate bin.

Large bins hold bigger chunks, sorted by size. When a request comes for a size that doesn’t have its own bin, the allocator searches for the smallest chunk that satisfies the request.

Unsorted bin is a holding area for chunks that were just freed but haven’t yet been sorted into the appropriate bin. This acts as a cache—recently freed chunks are likely to be reallocated soon.

ptmalloc bin hierarchy showing how chunks flow between different bin types

Image source: Meta Engineering - Scalable memory allocation using jemalloc

The allocation flow checks bins in order: tcache → fast bins → small bins → unsorted bin → large bins → top chunk (the heap’s free space) → system request. Each step is slower but handles larger or less common cases.

jemalloc: Facebook’s Answer to Fragmentation

When Facebook examined their server memory usage, they found a disturbing pattern: applications with predictable allocation patterns were consuming far more memory than their active data warranted. The culprit was fragmentation—ptmalloc’s layout algorithms weren’t suited to their workloads.

jemalloc, originally developed by Jason Evans for FreeBSD, takes a fundamentally different approach to layout.

Arenas with Thread Caching

Like ptmalloc, jemalloc uses multiple arenas to reduce lock contention. But jemalloc goes further with thread-local caches (tcaches). Each thread maintains its own cache of recently-freed objects per size class. Allocation and deallocation of small objects happens entirely in the tcache with zero synchronization.

The tcache has limits. When a size class’s cache exceeds its quota, the oldest objects are gradually returned to the arena. This “garbage collection” happens incrementally during allocation calls, spreading the cost over time rather than causing sudden pauses.

Size Classes and Runs

jemalloc divides allocations into small, large, and huge categories. Small allocations (under 57KB by default) are served from “runs”—contiguous regions of memory carved from 4MB chunks. Each run serves objects of exactly one size class.

The clever part: jemalloc maintains red-black trees of non-full runs for each size class. When allocating, it chooses the lowest-address non-full run. This preference for low addresses dramatically reduces fragmentation because objects of similar ages and sizes cluster together in memory.

jemalloc arena structure showing chunks, runs, and thread caches

Image source: Meta Engineering - Scalable memory allocation using jemalloc

Facebook’s benchmarks showed jemalloc outperforming ptmalloc by 10-30% in throughput while using significantly less memory for fragmented workloads.

tcmalloc: Google’s Per-CPU Architecture

Google’s tcmalloc introduced an innovation that seems obvious in hindsight: instead of per-thread caches, use per-CPU caches.

The Problem with Per-Thread Caches

Thread counts have exploded. A modern server might have 128 hardware threads or more. Per-thread caches scale linearly: 128 threads × 256KB per cache = 32MB of cached memory just for the allocators. If many threads are idle, their caches hold memory that could serve active threads.

Per-CPU Mode and Restartable Sequences

tcmalloc’s solution assigns a cache to each CPU core rather than each thread. When Thread A running on CPU 0 allocates memory, it uses CPU 0’s cache. If Thread A migrates to CPU 1, it now uses CPU 1’s cache. The caches are sized for the number of cores, not threads.

This requires a clever mechanism. When a thread allocates, it must atomically update its CPU’s cache without taking a lock. tcmalloc uses “restartable sequences”—a Linux kernel feature that restarts an instruction sequence if the thread is preempted mid-execution. The allocation code runs without locks, but if the kernel context-switches the thread, the entire operation restarts.

// Simplified per-CPU allocation pseudocode
void* malloc_per_cpu(size_t size) {
    int cpu = get_current_cpu();
    void* ptr = cpu_cache[cpu].allocate(size);
    if (ptr) return ptr;
    // Slow path: refill from central cache
    return refill_and_allocate(cpu, size);
}

Spans and the PageHeap

tcmalloc organizes memory into “spans”—contiguous runs of pages (typically 8KB pages, though this is configurable). A span either holds one large object or is carved into many small objects of the same size class.

The PageHeap manages free spans as an array of linked lists: the k-th entry holds free spans of k pages. For spans larger than 256 pages, a single list holds them all.

tcmalloc internal architecture showing front-end, middle-end, and back-end components

Image source: Google TCMalloc Design Documentation

When the front-end (per-CPU cache) needs memory, it requests from the middle-end (transfer cache and central free list). If those are exhausted, the back-end (PageHeap) allocates more spans from the operating system.

mimalloc: Free List Sharding

Microsoft’s mimalloc, released in 2019, introduced a different approach to the same problems. Its key insight: contention happens when multiple threads try to manipulate the same free list.

Page-Local Free Lists

Instead of a single free list per size class per thread, mimalloc maintains multiple free lists per “page” (a 64KB region in mimalloc’s terminology). Each page serves objects of one size class, and each page has its own free list.

When a thread allocates, it takes from the current page’s free list. When the thread frees an object, it returns it to that object’s page—not necessarily the current page. This “free list sharding” dramatically reduces contention because operations on different pages don’t interfere.

Deferred Freeing

A particularly clever feature is deferred freeing for cross-thread deallocation. When Thread A frees an object allocated by Thread B, Thread A doesn’t immediately manipulate Thread B’s data structures. Instead, it adds the object to a “thread-local free list” that Thread B processes during its next allocation.

This eliminates the cross-thread lock acquisition that plagues ptmalloc. Thread B never has to wait for Thread A; it simply processes deferred frees when convenient.

Research benchmarks show mimalloc outperforming both jemalloc and tcmalloc on many workloads, with 7-14% improvements on Redis and similar applications.

Performance Characteristics in Practice

Microbenchmarks tell only part of the story. Real applications have complex allocation patterns that interact with allocator design in subtle ways.

Throughput vs. Latency: tcmalloc often achieves the highest throughput for multi-threaded workloads, but its per-CPU caches can introduce latency variance. jemalloc tends to have more consistent latency. For latency-sensitive applications like trading systems, this matters more than raw throughput.

Memory Overhead: ptmalloc with multiple arenas can use 2-3× the memory of jemalloc for fragmented workloads. The tcache and fast bins hold memory that the application considers freed. For memory-constrained containers, this overhead directly limits application capacity.

Startup Time: Applications that allocate heavily during initialization (common in machine learning and data processing frameworks) are sensitive to allocator initialization. jemalloc’s arena setup and tcmalloc’s per-CPU cache initialization add overhead. mimalloc’s simpler design often results in faster startup.

Fragmentation Resistance: jemalloc’s low-address preference consistently produces lower fragmentation than ptmalloc for long-running services. An application that runs for months without restarting can see memory usage grow without bound under ptmalloc while remaining stable under jemalloc.

Choosing an Allocator

The choice depends on workload characteristics:

Workload Type	Recommended Allocator	Rationale
Multi-threaded server with producer-consumer patterns	tcmalloc	Per-CPU caches minimize contention
Long-running service with varied allocation sizes	jemalloc	Superior fragmentation resistance
Memory-constrained container	jemalloc or mimalloc	Lower memory overhead
Latency-sensitive application	mimalloc	Consistent latency with less variance
Single-threaded or low-thread-count application	ptmalloc is often sufficient	Default, no extra dependencies

Switching allocators is typically straightforward: link against the alternative library at compile time or use LD_PRELOAD to inject it at runtime for existing binaries. But the impact on production systems should be measured—the wrong allocator can make performance worse, not better.

The Hidden Layer

Memory allocators are the hidden layer between every program and the operating system. They make trade-offs invisible to developers who call malloc without considering what happens inside. Understanding these trade-offs—between throughput and memory efficiency, between latency and fragmentation—enables better system design.

The next time you see a program that uses more memory than expected, or a service that slows down over time, or a multi-threaded application that doesn’t scale past 8 cores, consider the allocator. The problem might not be in your code—it might be in the layer you didn’t know existed.

References

Evans, J. (2006). “A Scalable Concurrent malloc(3) Implementation for FreeBSD.” BSDCan 2006. https://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf
Facebook Engineering. (2011). “Scalable memory allocation using jemalloc.” https://engineering.fb.com/2011/01/03/core-infra/scalable-memory-allocation-using-jemalloc/
Google. (2024). “TCMalloc Design Documentation.” https://google.github.io/tcmalloc/design.html
Leijen, D., Zorn, B., & de Moura, L. (2019). “Mimalloc: Free List Sharding in Action.” APLAS 2019. https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action/
glibc Wiki. (2022). “MallocInternals.” https://sourceware.org/glibc/wiki/MallocInternals
Azeria Labs. (2018). “Heap Exploitation Part 2: Understanding the Glibc Heap Implementation.” https://azeria-labs.com/heap-exploitation-part-2-glibc-heap-free-bins/
Axura. (2024). “Ptmalloc – The GNU Allocator: A Deep Gothrough on How Malloc & Free Work.” https://4xura.com/binex/ptmalloc-the-gnu-allocator-a-deep-gothrough-on-how-malloc-free-works/
Frosner, D. (2025). “libmalloc, jemalloc, tcmalloc, mimalloc - Exploring Different Memory Allocators.” https://dev.to/frosnerd/libmalloc-jemalloc-tcmalloc-mimalloc-exploring-different-memory-allocators-4lp3
Berger, E. D., McKinley, K. S., Blumofe, R. D., & Wilson, P. R. (2000). “Hoard: A Scalable Memory Allocator for Multithreaded Applications.” ASPLOS-IX.
IT Hare. (2018). “Testing Memory Allocators: ptmalloc2 vs tcmalloc vs hoard vs jemalloc.” http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/
Lockless Inc. “Memory Allocator Benchmarks.” https://locklessinc.com/benchmarks_allocator.shtml
Wikipedia. “Fragmentation (computing).” https://en.wikipedia.org/wiki/Fragmentation_(computing)

The Fundamental Problem: Memory Fragmentation#

From sbrk to mmap: How Allocators Get Memory#

ptmalloc: The Linux Default#

Arenas and Lock Contention#

The Bin Hierarchy#

jemalloc: Facebook’s Answer to Fragmentation#

Arenas with Thread Caching#

Size Classes and Runs#

tcmalloc: Google’s Per-CPU Architecture#

The Problem with Per-Thread Caches#

Per-CPU Mode and Restartable Sequences#

Spans and the PageHeap#

mimalloc: Free List Sharding#

Page-Local Free Lists#

Deferred Freeing#

Performance Characteristics in Practice#

Choosing an Allocator#

The Hidden Layer#

References#