A 4K video at 60 frames per second contains roughly 1,423 megabits of raw data every second—enough to fill a typical home internet connection 14 times over. Yet streaming platforms deliver that same content at 15-25 megabits per second, and you barely notice the difference. This 50-100x reduction isn’t magic. It’s mathematics applied with ruthless efficiency.
The techniques that make this possible have evolved over three decades, from the H.261 videoconferencing standard in 1988 to today’s AV1 and H.266/VVC codecs. Each generation has squeezed out additional compression while maintaining perceptual quality, but the fundamental principles remain unchanged: exploit redundancy in space and time, discard information humans can’t perceive, and encode the remainder as efficiently as possible.
What Makes Video Compressible
Video compression works because video data is profoundly redundant in two dimensions.
Spatial redundancy exists within each frame. Adjacent pixels in an image rarely differ dramatically—a blue sky transitions gradually, a face contains large regions of similar skin tone. If you know the value of one pixel, you can often predict its neighbors with reasonable accuracy.
Temporal redundancy exists between frames. A typical video changes remarkably little from one frame to the next. A person walking across a static background moves perhaps 1% of the pixels between consecutive frames. The remaining 99% could simply be copied from the previous frame.
A landmark study by the Video Coding Experts Group quantified this: in typical video content, over 90% of the information in one frame can be predicted from neighboring frames. This is the foundation upon which all modern video compression is built.
The Frequency Domain: Where Compression Begins
The first major compression step transforms pixel data from the spatial domain to the frequency domain using the Discrete Cosine Transform (DCT). This mathematical operation, first proposed by Nasir Ahmed in 1972, has become the single most important tool in image and video compression.
The DCT takes an 8×8 block of pixels and produces an 8×8 block of coefficients. Each coefficient represents the amplitude of a specific frequency pattern. The top-left coefficient (the DC coefficient) represents the average brightness of the block. The other 63 coefficients (AC coefficients) represent how the block varies from that average at different frequencies.
Here’s why this matters: natural images concentrate most of their energy in low frequencies. A face contains broad regions of skin tone with gradual shading changes—low frequency content. The fine details—eyelashes, skin pores—contribute high frequencies but with much lower amplitude.
After applying DCT, typical blocks have large values in the top-left corner and values near zero elsewhere. This concentration of energy in a few coefficients is what makes compression possible.

Image source: Wikipedia - Discrete Cosine Transform
Quantization: The Lossy Step
The DCT itself is lossless—apply the inverse DCT and you recover the exact original data. Compression happens in the next step: quantization.
Quantization divides each DCT coefficient by a value from a quantization matrix and rounds to the nearest integer. Higher values in the matrix mean more aggressive division and more information loss. The quantization matrix is designed to zero out high-frequency coefficients aggressively while preserving low-frequency information more carefully.
This exploits a quirk of human vision: we’re much more sensitive to errors in low-frequency content than high-frequency content. A slight shift in average brightness is immediately noticeable. The same magnitude of error in fine texture details goes unnoticed.
The quantization parameter (QP) controls this tradeoff globally. Each step increase in QP roughly doubles the quantization step size. A QP increase of 6 typically reduces bitrate by half while roughly doubling the distortion. This exponential relationship gives encoders fine-grained control over the rate-distortion tradeoff.
graph LR
A[8×8 Pixel Block] --> B[DCT Transform]
B --> C[Frequency Coefficients]
C --> D[Quantization]
D --> E[Mostly Zero Coefficients]
E --> F[Entropy Coding]
F --> G[Compressed Bitstream]
style D fill:#ffcccc
style E fill:#ccffcc
Frame Types: The GOP Structure
Not all frames are compressed the same way. Modern codecs use three fundamental frame types arranged in a Group of Pictures (GOP).
I-frames (Intra-coded pictures) are compressed using only information within the frame itself—no reference to other frames. They’re essentially standalone images, compressed using spatial prediction and the DCT-quantization-entropy coding pipeline. I-frames are the largest but can be decoded independently.
P-frames (Predicted pictures) use temporal prediction from previously decoded frames. For each block, the encoder searches for a matching region in a previous frame and encodes only the difference (residual) between the current block and its prediction. P-frames typically require 50% less data than I-frames.
B-frames (Bi-directional pictures) can use both past and future frames as references. This bi-directional prediction captures more information about motion, allowing even greater compression—B-frames often need only 25% of an I-frame’s data. The tradeoff is increased complexity: decoding a B-frame requires waiting for future frames to be decoded first.

Image source: OTTVerse
A typical GOP structure might be IBBPBBPBBPBBP—a single I-frame followed by three mini-GOPs, each with two B-frames and one P-frame. The GOP length determines the interval between I-frames, which affects both compression efficiency and seeking performance. Longer GOPs improve compression but make seeking slower since you must decode from the nearest I-frame.
Motion Estimation: Finding What Moves Where
Motion estimation is the most computationally expensive part of video encoding—often accounting for 60-80% of encoding time. For each block in a P-frame or B-frame, the encoder must find the best matching region in one or more reference frames.
The naive approach—exhaustive search—checks every possible position in the reference frame. For a 16×16 block in a 1920×1080 frame, that’s over two million comparisons per block. Modern encoders use intelligent search algorithms that reduce this to hundreds of comparisons.
The diamond search pattern starts at the co-located position (same coordinates as the current block) and checks nine positions arranged in a diamond. If the best match is at the center, the search terminates. Otherwise, it shifts the center to the best position and repeats. This exploits the observation that most motion vectors are small—objects rarely jump across the frame between consecutive frames.
The hexagonal search extends this idea with a larger pattern, useful for videos with significant motion. Modern encoders often combine multiple patterns adaptively based on content characteristics.
Sub-pixel motion estimation further refines predictions. A block might not move an integer number of pixels—it could shift by 2.3 pixels horizontally and 1.7 pixels vertically. The encoder interpolates reference frames at fractional positions to find better matches. H.264 supports quarter-pixel precision; H.265 extends this with more sophisticated interpolation filters.
From H.264/AVC to H.265/HEVC: A Generation Leap
The transition from H.264 (standardized in 2003) to H.265/HEVC (2013) illustrates how codec design evolves to push compression efficiency further.
Block structure: H.264 uses macroblocks fixed at 16×16 pixels. H.265 introduces Coding Tree Units (CTUs) up to 64×64 pixels, which can be recursively subdivided using a quadtree structure. This adapts block size to content—large blocks for smooth regions, small blocks for complex textures or moving edges.
Intra prediction: H.264 supports 9 directional prediction modes for 4×4 blocks. H.265 increases this to 35 modes, capturing more complex spatial patterns. Planar and DC modes handle gradual gradients, while 33 angular modes predict edges at various orientations.
Inter prediction: H.265 introduces merge mode, which signals a motion vector prediction directly from neighboring blocks without transmitting a motion vector difference. Advanced motion vector prediction (AMVP) provides more accurate motion vector predictions from spatial and temporal neighbors.
Transform: H.265 supports multiple transform sizes (4×4, 8×8, 16×16, 32×32) chosen adaptively for each prediction unit. Larger transforms are more efficient for smooth regions; smaller transforms handle complex details better.
In-loop filtering: H.265 adds the Sample Adaptive Offset (SAO) filter after the deblocking filter. SAO reduces banding artifacts by adding offsets to reconstructed samples based on classification into edge or band categories.
The result: H.265 achieves roughly 50% bitrate reduction compared to H.264 at equivalent quality. A 4K stream that requires 18-20 Mbps with H.264 needs only 7-10 Mbps with H.265.
AV1 and H.266/VVC: The Current Frontier
AV1, developed by the Alliance for Open Media (including Google, Netflix, Amazon, and others), offers an open-source, royalty-free alternative to HEVC. Benchmarks from Moscow State University show AV1 outperforming HEVC by approximately 28% in compression efficiency, particularly at lower bitrates and higher resolutions.
AV1 introduces several innovations:
Superblock structure: Similar to HEVC’s CTU, but supports 128×128 superblocks with more flexible partitioning patterns including 4:1 rectangular partitions.
Enhanced intra prediction: 56 directional modes plus palette mode for screen content (text, graphics) where traditional prediction fails.
Constrained directional enhancement filter (CDEF): A replacement for HEVC’s deblocking and SAO filters, applying directional filtering to reduce artifacts while preserving edges.
Loop restoration filter: Applies Wiener filtering or self-guided restoration to selected regions, reducing ringing and blocking artifacts.
H.266/VVC (Versatile Video Coding), finalized in 2020, targets another 30-50% improvement over HEVC. It pushes complexity even higher—roughly 10x the encoding complexity of HEVC. Key innovations include:
Coding unit partitions: Beyond quadtree splitting, VVC supports binary and ternary splits, allowing more flexible adaptation to content structure.
Matrix-weighted intra prediction: Uses a set of 67 angular modes plus reference sample smoothing for better gradient prediction.
Affine motion compensation: Models motion as translation plus scaling and rotation, capturing complex motion like zooming or rotation that simple translation vectors cannot.
Adaptive loop filter: Applies Wiener filtering adaptively based on local characteristics, providing up to 5% additional compression improvement.
The Complexity-Quality Tradeoff
Each generation of codecs has improved compression efficiency by roughly 50% while increasing computational complexity by 2-10x. This creates a fundamental tension: better compression requires more processing power.
For live streaming, this complexity matters. Real-time encoding of 4K 60fps content with H.265 requires substantial hardware acceleration. Software encoding might achieve 2-3 frames per second on a high-end CPU—far from real-time. Hardware encoders (dedicated silicon in GPUs or specialized chips) sacrifice some compression efficiency for speed, achieving real-time encoding with 5-15% quality penalty compared to software encoding.
This explains why different use cases favor different approaches:
- Live streaming: Hardware encoding (dedicated GPU/ASIC encoders) for real-time performance
- Video on demand: Software encoding for maximum compression efficiency
- Archival: Two-pass encoding for optimal rate control
- Video conferencing: Hardware encoding with low-latency presets
Chroma Subsampling: Trading Color for Bandwidth
Human vision is more sensitive to luminance (brightness) than chrominance (color) information. Video compression exploits this through chroma subsampling—the practice of encoding color information at lower resolution than brightness information.
The notation 4:2:0 means that for every 4 luminance samples horizontally and 2 vertically, there’s only 1 chrominance sample for each color channel. This reduces color data by 75% with virtually no perceptible impact for most content.
For 4K video (3840×2160), 4:2:0 subsampling means the color channels are stored at 1920×1080—half the horizontal and vertical resolution. The luminance channel remains full resolution since that’s what our eyes are most sensitive to.
| Subsampling | Color Resolution | Data Reduction | Typical Use |
|---|---|---|---|
| 4:4:4 | Full | None | Professional editing |
| 4:2:2 | Half horizontal | 33% | Broadcast production |
| 4:2:0 | Half both axes | 50% | Streaming, consumer video |
Quality Metrics: Measuring What Matters
Traditional quality metrics like PSNR (Peak Signal-to-Noise Ratio) measure pixel-level accuracy. But PSNR correlates poorly with perceived quality—a video can have high PSNR yet look terrible if artifacts fall in visually sensitive regions.
SSIM (Structural Similarity Index) improved on this by measuring structural changes rather than absolute pixel differences. SSIM correlates better with subjective quality but still misses some perceptually important factors.
VMAF (Video Multimethod Assessment Fusion), developed by Netflix, combines multiple quality metrics using machine learning trained on subjective quality scores. VMAF correlates significantly better with human perception and has become the de facto standard for codec comparisons.
| Metric | What It Measures | Correlation with Perception |
|---|---|---|
| PSNR | Pixel-level error | Poor |
| SSIM | Structural similarity | Moderate |
| VMAF | Perceptual quality (ML-based) | Strong |
Why This Matters
The economics of video are staggering. Netflix alone streams over 1 billion hours of video per day. A 1% improvement in compression efficiency translates to petabytes of daily bandwidth savings—millions of dollars in CDN costs annually.
But compression isn’t just about economics. It’s about access. A farmer in rural India with a 2 Mbps connection can stream educational content that would require 10+ Mbps without modern codecs. Emergency communications over constrained satellite links become feasible. The ability to deliver video at progressively lower bitrates has democratized access to video content globally.
The next time you watch a 4K video stream smoothly over your home connection, remember: behind that seamless experience lies decades of mathematical innovation, careful engineering tradeoffs, and an intimate understanding of how human vision works. The 100x compression isn’t a miracle—it’s the accumulated wisdom of an entire field pushing against fundamental limits.
References
- Ahmed, N., Natarajan, T., & Rao, K. R. (1974). Discrete Cosine Transform. IEEE Transactions on Computers.
- Wiegand, T., et al. (2003). Overview of the H.264/AVC Video Coding Standard. IEEE Transactions on Circuits and Systems for Video Technology.
- Sullivan, G. J., et al. (2012). Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Transactions on Circuits and Systems for Video Technology.
- Bossen, F., et al. (2021). VVC in a Nutshell. IEEE Transactions on Circuits and Systems for Video Technology.
- Netflix Technology Blog. (2016). Toward A Practical Perceptual Video Quality Metric.
- Alliance for Open Media. (2018). AV1 Bitstream & Decoding Process Specification.
- Richardson, I. E. (2010). The H.264 Advanced Video Coding Standard. Wiley.
- ISO/IEC. (2020). High Efficiency Video Coding (HEVC) - ISO/IEC 23008-2.
- Bitmovin. (2020). State of Compression: Testing H.266/VVC vs H.265/HEVC.
- Elecard. (2018). Video Encoding: Inter Prediction in HEVC.