In March 2013, Solomon Hykes demonstrated a new tool at PyCon that would fundamentally change how software gets deployed. He ran a simple command: docker run -i -t ubuntu /bin/bash. Within seconds, a complete Ubuntu environment appeared, ready to accept commands. The audience saw what looked like a lightweight virtual machine. What they were actually witnessing was something far more elegant: a single Linux process, wrapped in kernel features that had been maturing for over a decade.
That process—the container—didn’t need a hypervisor, a separate kernel, or a full operating system image. It needed only the Linux kernel’s built-in isolation primitives. Understanding these primitives reveals why containers start in milliseconds while virtual machines take minutes, and why you can run thousands of containers on a host where dozens of VMs would exhaust resources.
From chroot to Containers: A Thirty-Year Journey
The conceptual ancestor of containers appeared in 1979, during the development of Unix Version 7. The chroot system call changed the root directory for a process and its children, effectively isolating them from the rest of the filesystem. A process inside a chroot environment could only see files within its designated directory tree.
But chroot was limited. It only isolated the filesystem view. The process still shared the same process ID space, network stack, and user identities as the host. A malicious process could still see other processes, access network resources, and potentially escape its directory prison through various mechanisms.
The real breakthrough came in two parts. First, namespaces arrived gradually between 2002 and 2013, each type addressing a specific isolation need. Second, cgroups (Control Groups) were developed at Google in the mid-2000s and merged into the Linux kernel in 2008. Together, these two kernel features—namespaces for isolation, cgroups for resource control—form the foundation of modern containers.
Namespaces: The Illusion of Solitude
Namespaces are a Linux kernel feature that partitions kernel resources such that one set of processes sees one set of resources while another set sees a different set. The feature works by having the same namespace identifier for a group of resources, but those namespaces refer to distinct resources.
Think of it like this: two processes in different PID namespaces can both have PID 1. Each thinks it’s the init process. Each sees only processes within its own namespace. They’re both right, from their perspective.
The Linux kernel currently implements eight namespace types, with seven commonly used in containers:
PID Namespace: Process Isolation
The PID namespace isolates process ID numbers. Inside a container, processes start numbering from 1—the init process position. This matters because many applications expect to run as PID 1 or assume certain behaviors about process hierarchy.
When you run a container, the first process inside gets PID 1. That process can only see other processes in its namespace. It cannot see or signal processes in other containers or on the host.
The mechanics involve the clone() system call with the CLONE_NEWPID flag. When a process is created with this flag, it and its children enter a new PID namespace. The kernel maintains a mapping: inside the namespace, a process might have PID 5, while externally it has PID 45231. The namespace provides the translation.
Network Namespace: Isolated Network Stacks
Network namespaces provide isolated network stacks—separate network devices, IP addresses, routing tables, and firewall rules. This is why each container can bind to port 80 without conflict.
When a container starts, the runtime creates a new network namespace. It then creates a virtual Ethernet pair (veth pair)—two virtual network interfaces connected by a virtual cable. One end stays in the host’s network namespace (often attached to a bridge like docker0), and the other end moves into the container’s namespace.
Host Network Namespace Container Network Namespace
┌─────────────────────┐ ┌─────────────────────┐
│ docker0 bridge │ │ │
│ 172.17.0.1 │ │ │
│ │ │ │ │
│ veth12345 │◄──────────►│ eth0 │
│ │ veth │ 172.17.0.2 │
│ │ pair │ │
└─────────────────────┘ └─────────────────────┘
This architecture explains why container networking has overhead. Each packet traverses the bridge, through the veth pair, into the container’s namespace. The isolation is complete but not free.
Mount Namespace: Filesystem Isolation
Mount namespaces isolate the set of filesystem mount points seen by a group of processes. This is chroot on steroids. A process in a mount namespace can have a completely different view of the filesystem tree.
Crucially, mount namespaces support propagation types that control how mount events spread between namespaces. A “private” mount in one namespace doesn’t affect others. A “shared” mount propagates changes. This enables the overlay filesystem magic that makes container images efficient.
UTS Namespace: Hostname Isolation
The UTS (Unix Time-sharing System) namespace isolates the hostname and domain name. This seems trivial but matters for applications that use hostname for identification or clustering. Each container can have its own hostname without affecting others.
IPC Namespace: Inter-Process Communication Isolation
The IPC namespace isolates System V IPC objects and POSIX message queues. Processes in different IPC namespaces cannot communicate through shared memory segments, semaphores, or message queues unless explicitly connected through other mechanisms.
This prevents a container from interfering with another container’s shared memory operations—a critical isolation for databases and other applications that use IPC heavily.
User Namespace: UID/GID Mapping
User namespaces are perhaps the most powerful. They map user and group IDs inside a namespace to different IDs outside. A process can be root (UID 0) inside a container while being an unprivileged user (UID 100000) on the host.
This enables rootless containers—containers that don’t require elevated privileges on the host. The process thinks it has root access, but any file operations are mapped to its unprivileged host UID.
Cgroup Namespace: Cgroup View Isolation
The cgroup namespace isolates the view of the cgroup hierarchy. A process in a cgroup namespace sees its current cgroup as the root of the hierarchy. This prevents information leakage about the container orchestration system’s cgroup organization.
Creating Namespaces: The Kernel API
The namespace API consists of three system calls and a set of /proc files:
- clone(): Creates a new process in new namespace(s), specified via
CLONE_NEW*flags - unshare(): Moves the calling process into new namespace(s)
- setns(): Joins an existing namespace
Every process has entries in /proc/PID/ns/ representing its namespaces:
$ ls -l /proc/$$/ns
lrwxrwxrwx 1 user user 0 Jan 8 12:00 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 user user 0 Jan 8 12:00 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 user user 0 Jan 8 12:00 net -> 'net:[4026531956]'
lrwxrwxrwx 1 user user 0 Jan 8 12:00 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 user user 0 Jan 8 12:00 user -> 'user:[4026531837]'
lrwxrwxrwx 1 user user 0 Jan 8 12:00 uts -> 'uts:[4026531838]'
The numbers in brackets are inode numbers. Two processes in the same namespace will show the same inode number. This provides a reliable way to check if processes share a namespace.
Cgroups: The Resource Accountant
While namespaces provide isolation, cgroups provide resource control. A process might be isolated from other processes, but without cgroups, it could still consume all available CPU, memory, or I/O bandwidth.
Cgroups organize processes into hierarchical groups and apply resource limits to those groups. The limits apply collectively to all processes in the group.
The Cgroup Controllers
Cgroups work through controllers (also called subsystems), each managing a specific resource type:
- cpu: Limits CPU time allocation using shares and quotas
- memory: Limits memory usage and triggers OOM kills when exceeded
- blkio (v1) / io (v2): Limits block I/O bandwidth
- pids: Limits the number of processes that can be created
- cpuset: Binds processes to specific CPU cores
- devices: Controls access to device nodes
- freezer: Suspends and resumes processes
When you run docker run --memory=512m --cpus=2 nginx, Docker creates cgroups and configures the memory and cpu controllers. The container’s processes are added to those cgroups, and the kernel enforces the limits.
Cgroups v1 vs v2
Cgroups went through a significant redesign. Cgroups v1 had separate hierarchies for each controller—you could mount the cpu controller in one hierarchy and memory in another. This flexibility created complexity and inconsistency.
Cgroups v2, merged in Linux 4.5 and now the default in most distributions, provides a unified hierarchy. All controllers attach to the same cgroup tree, simplifying management and improving consistency.
graph TD
A[cgroup root] --> B[user.slice]
A --> C[system.slice]
A --> D[machine.slice]
B --> E[user-1000.slice]
C --> F[docker.service]
F --> G[container-abc123]
G --> H[cpu.max: 200000 100000]
G --> I[memory.max: 536870912]
The unified hierarchy means a process’s resource limits are visible in one place under /sys/fs/cgroup/. Setting a memory limit on a container’s cgroup automatically affects all descendant cgroups.
How CPU Limits Actually Work
The cpu controller uses two main parameters: cpu.shares (v1) or cpu.weight (v2) for relative allocation, and cpu.cfs_quota_us / cpu.cfs_period_us for absolute limits.
Relative shares work like this: if container A has 512 shares and container B has 1024 shares, and both are competing for CPU, container B gets twice as much CPU time. But if only container A is running, it can use all available CPU.
Absolute quotas work differently: setting a quota of 100,000 microseconds per period of 100,000 microseconds means the container gets exactly 1 CPU core, no more, no less. This is what happens when you specify --cpus=1.
Memory Limits and OOM Killer
The memory controller tracks anonymous memory, file cache, and kernel memory. When a container exceeds its memory limit, the kernel invokes the OOM (Out of Memory) killer to terminate processes.
You can configure what happens: kill the process, pause the container, or take other actions. But fundamentally, memory limits are hard walls—processes cannot allocate beyond them.
OverlayFS: The Layered Filesystem
Containers need filesystem isolation, but copying a complete filesystem for each container would be wasteful. A 1GB base image run as 100 containers shouldn’t consume 100GB.
OverlayFS provides the solution through union mounting. It combines multiple directories into a single unified view:
- Lower directories: Read-only layers, typically the image layers
- Upper directory: Writable layer for container changes
- Work directory: Required for internal operations
- Merged directory: The unified view presented to the container
graph BT
A[Lower Layer 3<br/>Base Image] --> D[Merged View]
B[Lower Layer 2<br/>Application] --> D
C[Lower Layer 1<br/>Config] --> D
E[Upper Layer<br/>Container Changes] --> D
D --> F[Container Process]
When a container reads a file, OverlayFS checks the upper layer first. If the file exists there, it returns that version. If not, it checks lower layers from top to bottom, returning the first match.
When a container writes to an existing file, OverlayFS performs copy-on-write. It copies the file from a lower layer to the upper layer, then modifies the copy. The original remains untouched, shared with any other containers using the same image.
When a container deletes a file, OverlayFS doesn’t actually remove it from lower layers. Instead, it creates a “whiteout” marker in the upper layer—a character device with device number 0/0. This signals that the file doesn’t exist in the merged view.
Image Layers in Practice
A container image is a stack of read-only layers, each representing a filesystem change. When you build an image:
FROM ubuntu:22.04 # Layer 1: Base OS (~72MB)
RUN apt-get update # Layer 2: Package lists
RUN apt-get install nginx # Layer 3: Nginx (~60MB)
COPY ./app /app # Layer 4: Application files
CMD ["nginx"] # Metadata only, no layer
Each RUN and COPY instruction creates a new layer. These layers are stored separately and can be shared. If you build another image from the same ubuntu:22.04 base, it reuses the existing Layer 1.
This is why pulling an image often shows “Layer already exists”—you’re downloading only the layers you don’t have. And why modifying one layer doesn’t affect others—changes are isolated.
The Container Runtime Stack
Docker isn’t a single monolithic program. It’s a stack of components, each with a specific responsibility:
Docker Daemon (dockerd)
The Docker daemon is the long-running background process that manages containers. It exposes a REST API, accepts commands from the Docker CLI, and coordinates all container operations. It handles image management, networking, volumes, and higher-level orchestration concerns.
containerd
containerd is an industry-standard container runtime that manages the complete container lifecycle: image transfer, container execution, and supervision. Docker delegates actual container management to containerd, which provides a clean API for container operations without the higher-level concerns.
containerd pulls images, creates containers, and manages their lifecycle. But it doesn’t actually create the low-level container primitives—it delegates that to lower-level runtimes.
runc
runc is the reference implementation of the OCI (Open Container Initiative) runtime specification. It does the actual work of creating containers using Linux kernel features.
When containerd needs to start a container, it:
- Prepares an OCI bundle: a root filesystem and a
config.jsondescribing the container - Invokes runc with this bundle
- runc calls
clone()with namespace flags, sets up cgroups, and configures the container - The container process starts
runc then exits. A shim process (containerd-shim) keeps the container running and handles I/O. This architecture means containers survive daemon restarts—the shim is the actual parent process.
The OCI Specifications
The Open Container Initiative defines two key specifications:
- Runtime Specification: How to run a container filesystem bundle (the format runc implements)
- Image Specification: How to package container images (the format registries use)
These standards ensure interoperability. An image built with Docker can run with Podman, containerd, or any OCI-compliant runtime. The specification defines the contract; different tools implement it differently but produce compatible results.
Containers vs Virtual Machines: The Architecture Difference
The fundamental difference between containers and VMs is what gets virtualized:
| Aspect | Container | Virtual Machine |
|---|---|---|
| Kernel | Shared with host | Separate kernel per VM |
| Isolation | Namespace/cgroup boundaries | Hardware-level virtualization |
| Overhead | 1-2% | 5-20% |
| Startup | Milliseconds to seconds | Seconds to minutes |
| Memory | Shared page cache | Separate memory allocation |
| Storage | Shared image layers | Separate disk images |
A virtual machine hypervisor virtualizes hardware. Each VM gets virtual CPU, memory, disk, and network devices. A guest operating system runs on these virtual devices, complete with its own kernel. This provides strong isolation but at significant cost.
A container runtime virtualizes operating system resources. The kernel is shared; only process-visible resources are virtualized. This is lighter weight because:
- No guest kernel boot sequence
- No separate memory allocation for kernel data structures
- No hardware emulation overhead
- Shared page cache for identical files
Performance measurements consistently show containers have minimal overhead compared to bare metal, while VMs show measurable degradation. A 2017 IBM study found Docker containers had near-native CPU and memory performance, while KVM VMs showed 10-15% overhead for CPU-intensive workloads and higher for I/O-intensive workloads.
Security: The Isolation Reality
Container security relies on multiple layers of kernel-enforced isolation:
Capabilities
Linux capabilities break root’s privileges into distinct units. Instead of giving a container full root power, Docker drops most capabilities by default, keeping only what’s needed. A container might have CAP_NET_BIND_SERVICE (bind to ports < 1024) but lack CAP_SYS_ADMIN (load kernel modules).
Seccomp
Seccomp (Secure Computing Mode) filters system calls. Docker’s default seccomp profile blocks about 44 syscalls considered dangerous or unnecessary for typical containers. A compromised container cannot use syscalls like reboot, mount, or keyctl.
AppArmor / SELinux
Mandatory Access Control systems provide an additional layer. Even if a container process escapes namespace isolation, MAC policies restrict what files and resources it can access.
The Kernel Attack Surface
The shared kernel is both a strength and a vulnerability. A container escape exploit in the kernel affects all containers on that host. This is why container isolation is sometimes called “soft isolation”—it depends on kernel correctness.
Kernel vulnerabilities like Dirty Pipe (CVE-2022-0847) and Container Escape via runc (CVE-2019-5736) have demonstrated that container isolation can be bypassed. For workloads requiring strong security boundaries, VMs or hardware-based confidential computing may be more appropriate.
Performance Implications
Understanding container internals reveals performance considerations:
Network overhead: Container networking involves bridge traversal and veth pairs. For high-throughput applications, consider host networking (--net=host) or SR-IOV for near-native performance.
Filesystem overhead: OverlayFS adds overhead for operations like stat() (which must check multiple layers) and first-time file writes (copy-on-write). For I/O-intensive workloads, consider volume mounts that bypass OverlayFS.
Memory limits: Hard memory limits trigger OOM kills. Monitor memory usage and set limits with headroom. Consider memory reservation (soft limit) alongside memory limit (hard limit).
CPU scheduling: Relative shares (cpu.shares) only matter during contention. For consistent performance, use CPU quotas or dedicated cores (cpuset).
Summary
Containers are not magic. They’re the deliberate combination of Linux kernel features that evolved over decades:
- Namespaces provide isolation—the illusion that a process has its own private resources
- Cgroups provide resource control—ensuring fair allocation and preventing runaway consumption
- OverlayFS provides efficient storage—sharing common data while isolating changes
- The OCI stack provides standardization—enabling interoperability across tools
When you run docker run, you’re not launching a mini-VM. You’re creating a process with carefully configured namespace membership, cgroup assignment, and an overlay filesystem. The process thinks it’s alone on a system, but it’s actually sharing a kernel with potentially thousands of other processes doing the same thing.
This shared-kernel architecture explains both containers’ efficiency and their limitations. Understanding it helps you debug container issues, optimize performance, and make informed security decisions. The kernel primitives aren’t new—what changed was packaging them into tools that made containerization accessible. The magic isn’t in the container; it’s in the kernel features that have been there all along.
References
- Kerrisk, M. (2013). “Namespaces in operation, part 2: the namespaces API”. LWN.net. https://lwn.net/Articles/531381/
- Linux Kernel Documentation. “Control Group v2”. https://docs.kernel.org/admin-guide/cgroup-v2.html
- Evans, J. (2019). “How containers work: overlayfs”. https://jvns.ca/blog/2019/11/18/how-containers-work--overlayfs/
- Red Hat. “The 7 most used Linux namespaces”. https://www.redhat.com/en/blog/7-linux-namespaces
- OCI - Open Container Initiative. “Runtime Specification”. https://github.com/opencontainers/runtime-spec
- Cole, J. (2014). “Understanding Docker”. https://jvns.ca/blog/2014/09/19/understanding-docker/
- Kubernetes Documentation. “Restrict a Container’s Syscalls with seccomp”. https://kubernetes.io/docs/tutorials/security/seccomp/
- Felter, W. et al. (2014). “An Updated Performance Comparison of Virtual Machines and Linux Containers”. IBM Research. https://domino.research.ibm.com/library/cyberdig.nsf/papers/0929052195DD819C85257D2300681E7B/$File/rc25482.pdf
- Docker Documentation. “Storage drivers”. https://docs.docker.com/storage/storagedriver/
- Wikipedia. “Linux namespaces”. https://en.wikipedia.org/wiki/Linux_namespaces
- Wikipedia. “Cgroups”. https://en.wikipedia.org/wiki/Cgroups
- Aqua Security. “A Brief History of Containers: From the 1970s Till Now”. https://www.aquasec.com/blog/a-brief-history-of-containers-from-1970s-chroot-to-docker-2016/