A 70-billion parameter model requires 140 GB of GPU memory in FP16. A consumer RTX 4090 has 24 GB. This arithmetic gap defined the boundary between “enterprise AI” and “what you can run at home” until quantization mathematics cracked the code.

The counterintuitive reality: reducing precision from 16 bits to 4 bits—a 75% compression—often preserves over 95% of model quality. Not through magic, but through a profound understanding of how neural networks encode information.

The Quantization Equation: From FP16 to INT4

At its core, quantization maps continuous floating-point values to discrete integer representations. For a weight matrix $W$, the quantized representation $\hat{W}$ follows:

$$\hat{W} = \text{round}\left(\frac{W - z}{s}\right) \cdot s + z$$

Where $s$ is the scale factor and $z$ is the zero-point offset. The deceptively simple formula hides immense complexity: choosing $s$ and $z$ optimally determines whether your model degrades gracefully or catastrophically.

The quantization error for uniform quantization follows:

$$\text{SQNR} \approx 6.02n + 1.76 \text{ dB}$$

Where $n$ is the number of bits. This classic signal processing result suggests 4-bit quantization should introduce significant noise. Yet LLMs thrive. Why?

# The fundamental quantization operation
def quantize(tensor, n_bits=4):
    # Find range
    min_val, max_val = tensor.min(), tensor.max()
    
    # Calculate scale and zero-point
    qmin, qmax = 0, 2**n_bits - 1
    scale = (max_val - min_val) / (qmax - qmin)
    zero_point = qmin - min_val / scale
    
    # Quantize and dequantize
    quantized = torch.clamp(
        torch.round(tensor / scale + zero_point),
        qmin, qmax
    )
    return quantized, scale, zero_point

The answer lies in weight distribution. LLM weights follow approximately Gaussian distributions centered near zero. Most information concentrates in a narrow band, with extreme values representing a tiny fraction. Per-tensor quantization wastes precision on outliers while under-representing the dense center.

The Granularity Hierarchy: Per-Tensor to Per-Channel

Modern quantization employs progressively finer granularity:

Granularity Parameters Memory Overhead Quality
Per-tensor 1 scale per layer Negligible Poor
Per-channel 1 scale per output channel 0.01% Good
Per-group 1 scale per 128 weights 0.5% Excellent
Per-vector Learned codebooks 1-2% Near-lossless

Per-group quantization, introduced in GPTQ and refined in AWQ, partitions weights into small groups (typically 64-128 elements), each with independent scaling. This local adaptation captures the varying dynamic ranges across a weight matrix.

# Per-group quantization splits the weight matrix
# Group size g = 128 typical for 4-bit quantization
def group_quantize(weight, group_size=128):
    out_features, in_features = weight.shape
    weight = weight.reshape(-1, group_size)
    
    # Independent scaling per group
    scales = weight.abs().max(dim=-1).values / (2**(n_bits-1) - 1)
    quantized = torch.round(weight / scales.unsqueeze(-1))
    
    return quantized, scales

The overhead? For a 7B model with group size 128, storing scales requires approximately 40 MB—a 0.5% memory tax for substantial quality gains.

GPTQ: The Hessian-Guided Revolution

In March 2023, researchers from ETH Zurich and IST Austria published a paper that became the industry standard. GPTQ (Gradient Post-Training Quantization) builds on Optimal Brain Quantization (OBQ), but with a crucial insight: processing weights in reverse order enables recursive error compensation.

The algorithm minimizes layer-wise output error:

$$\arg\min_{\hat{W}} \|WX - \hat{W}X\|^2$$

GPTQ maintains a Hessian inverse $H^{-1} = (XX^T)^{-1}$ updated after each weight quantization. When quantizing weight $w_i$ to $\hat{w}_i$, the error propagates to remaining weights:

$$\delta_F = -\frac{w_q - \hat{w}_q}{[H^{-1}]_{qq}} \cdot H^{-1}_{:,q}$$

This recursive compensation means early quantization errors get absorbed by later adjustments—a mathematical sleight of hand that preserves output fidelity.

# Simplified GPTQ core loop
def gptq_layer(W, H_inv, quantizer):
    for i in reversed(range(W.shape[1])):
        # Quantize current column
        w_q = quantizer.quantize(W[:, i])
        
        # Calculate error
        error = (W[:, i] - w_q) / H_inv[i, i]
        
        # Propagate error to remaining columns
        W[:, :i] -= error.unsqueeze(1) * H_inv[i, :i].unsqueeze(0)
        
        # Update Hessian inverse (Cholesky decomposition in practice)
        H_inv = update_hessian_inverse(H_inv, i)
    
    return quantizer.quantize(W)

The practical impact: a LLaMA-70B model quantized to 4-bit with GPTQ achieves 99.2% of its original MMLU score while requiring only 35 GB VRAM—a 75% reduction.

AWQ: Protecting the Vital 1%

MIT’s Song Han lab observed something peculiar in 2023: protecting merely 1% of weights at full precision preserves nearly all model quality. The insight launched Activation-aware Weight Quantization (AWQ).

The key observation: weight importance correlates with activation magnitude. A weight connected to high-activation channels disproportionately influences output. AWQ identifies these “salient weights” by running a calibration dataset through the model and recording activation magnitudes.

$$s_j = \max_i |X_{ij}|$$

The scale factor $s_j$ for channel $j$ derives from maximum activation. Channels with higher activations receive proportionally larger quantization regions.

# AWQ salient weight identification
def compute_importance(activations):
    # activations: [batch, seq_len, hidden_dim]
    importance = activations.abs().max(dim=0).values.max(dim=0).values
    return importance

def awq_quantize(W, activation_scales, group_size=128):
    # Scale weights to protect salient channels
    W_scaled = W * activation_scales.unsqueeze(0)
    
    # Group-wise quantization
    Q, scales = group_quantize(W_scaled, group_size)
    
    # Inverse scale during dequantization
    return Q, scales, activation_scales

AWQ’s advantage over GPTQ: no per-weight error propagation means faster quantization (seconds vs. minutes) and comparable quality. The trade-off? AWQ requires a calibration dataset; GPTQ works with random data.

Method Calibration Speed Perplexity (Llama-2-7B)
FP16 5.47
RTN (round-to-nearest) None Instant 12.30
GPTQ-4bit Random Slow 5.52
AWQ-4bit Calibration Fast 5.50
GGUF Q4_K_M None Fast 5.55

GGUF and K-Quants: The Consumer Revolution

While GPTQ and AWQ target GPU inference, the llama.cpp project democratized LLMs for CPU deployment. Its GGUF format introduced “K-quants”—a family of mixed-precision quantization schemes that balance size, speed, and quality.

The innovation: not all tensors deserve equal precision. Attention layers tolerate aggressive quantization; FFN layers require higher fidelity. K-quants exploit this heterogeneity.

The Q4_K_M scheme applies:

  • 4-bit quantization to most weights
  • 6-bit quantization to critical attention layers
  • Mixed group sizes (32, 64, 128) based on tensor importance
# K-quant block structure
# Super-block of 256 weights → subdivided into sub-blocks
class KQuantBlock:
    def __init__(self, weights, n_bits=4):
        # Super-block: 256 weights
        self.super_scale = weights.abs().max()
        
        # Sub-blocks: 32 weights each, 8 sub-blocks
        self.sub_scales = weights.reshape(8, 32).abs().max(dim=-1)
        
        # Quantize with hierarchical scaling
        self.quantized = self._quantize_hierarchical(weights)

The result: Q4_K_M achieves perplexity within 2% of FP16 while halving model size. For CPU inference on Apple Silicon, K-quants with Metal acceleration achieve 30+ tokens/second on a MacBook Pro—making local LLMs practical for everyday use.

FP8: The Hardware-Accelerated Frontier

NVIDIA’s Hopper architecture (H100/H200) introduced native FP8 support—a watershed moment for quantization. Unlike INT8, FP8 preserves the dynamic range of floating-point while halving memory bandwidth.

FP8 comes in two formats:

Format Exponent Mantissa Range Use Case
E4M3 4 bits 3 bits ±448 Weights, activations
E5M2 5 bits 2 bits ±57,344 Gradients

The E4M3 format provides sufficient precision for inference; E5M2 handles gradient computation during training. Combined with Tensor Core acceleration, FP8 enables 2× throughput improvement over FP16.

# FP8 quantization with E4M3 format
def fp8_e4m3_quantize(tensor):
    # E4M3: 4 exponent bits, 3 mantissa bits
    # Range: [-448, 448], no inf/nan
    max_val = 448.0
    
    # Clamp to representable range
    tensor_clamped = torch.clamp(tensor, -max_val, max_val)
    
    # Convert to FP8 representation
    fp8_tensor = tensor_clamped.half()  # Intermediate
    # Hardware-native FP8 operations on H100+
    
    return fp8_tensor

DeepSeek-V3’s training leveraged FP8 extensively, achieving 2× training efficiency while maintaining model quality. The implication: FP8 isn’t just an inference optimization—it’s reshaping how we train trillion-parameter models.

The Outlier Problem: Why Quantization Fails

Not all quantization stories end happily. In 2022, researchers discovered that LLMs develop “activation outliers”—channels with magnitudes 100× larger than typical. These outliers break naive quantization.

The mechanism: during training, certain attention heads develop outsized influence. Their outputs exhibit massive dynamic range, concentrated in specific feature dimensions. Per-tensor quantization, forced to accommodate these outliers, uses a scale factor that crushes precision for the remaining 99% of activations.

$$\text{Scale} = \frac{\max(|\text{activations}|)}{2^n - 1}$$

If one channel reaches 100 while others hover around 1, the quantization step becomes 100/15 ≈ 6.7 for 4-bit—obliterating fine distinctions in the majority of channels.

SmoothQuant (2022) proposed an elegant solution: migrate quantization difficulty from activations to weights. Since weights are static, they can be pre-processed:

$$\tilde{X} = \text{diag}(s)^{-1} X, \quad \tilde{W} = \text{diag}(s) W$$

The smoothing factor $s_j = \max(|X_j|)^\alpha / \max(|W_j|)^{1-\alpha}$ balances activation and weight magnitudes. With $\alpha \approx 0.5$, SmoothQuant enables INT8 quantization with near-zero quality loss.

# SmoothQuant activation smoothing
def smooth_layer(weight, activations, alpha=0.5):
    # Compute smoothing scales
    activation_scales = activations.abs().max(dim=0).values
    weight_scales = weight.abs().max(dim=0).values
    
    smooth_scale = (activation_scales ** alpha) / (weight_scales ** (1 - alpha))
    
    # Migrate quantization difficulty
    smoothed_weight = weight * smooth_scale.unsqueeze(0)
    smoothed_activation_scale = activation_scales / smooth_scale
    
    return smoothed_weight, smoothed_activation_scale

KV Cache Quantization: The Long-Context Memory Saver

For long-context inference, the KV cache dominates memory consumption. A 70B model processing 128K tokens requires approximately 80 GB just for KV storage. Quantization offers a 2-4× reduction.

The challenge: KV cache values are dynamic, generated during inference. Static calibration methods don’t apply. The solution? Per-token scaling with FP8.

# KV cache FP8 quantization
class QuantizedKVCache:
    def __init__(self, n_heads, head_dim, max_seq_len):
        # FP8 storage for K and V
        self.k_cache = torch.zeros(max_seq_len, n_heads, head_dim, dtype=torch.float8_e4m3fn)
        self.v_cache = torch.zeros(max_seq_len, n_heads, head_dim, dtype=torch.float8_e4m3fn)
        
        # Per-token scale factors
        self.k_scales = torch.zeros(max_seq_len)
        self.v_scales = torch.zeros(max_seq_len)
    
    def update(self, keys, values, positions):
        # Dynamic per-token quantization
        self.k_scales[positions] = keys.abs().max(dim=-1).values
        self.v_scales[positions] = values.abs().max(dim=-1).values
        
        self.k_cache[positions] = (keys / self.k_scales[positions].unsqueeze(-1)).to(torch.float8_e4m3fn)
        self.v_cache[positions] = (values / self.v_scales[positions].unsqueeze(-1)).to(torch.float8_e4m3fn)

vLLM’s FP8 KV cache implementation achieves 2× memory reduction with perplexity degradation under 1%. For production systems serving long contexts, this translates to 2× concurrent requests or 2× context length per request.

Extreme Quantization: The 2-Bit Frontier

What happens below 4 bits? The perplexity gap widens dramatically—unless you use specialized algorithms.

AQLM (Additive Quantization for LLMs) introduced codebook-based compression:

$$w \approx \sum_{i=1}^{k} c_{i, j_i}$$

Each weight becomes a sum of $k$ codebook entries. For 2-bit quantization ($k=2$ codebooks with 16 entries each), this achieves additive precision: 4 bits effective representation from 2-bit storage.

# Simplified AQLM additive quantization
class AQLMQuantizer:
    def __init__(self, n_codebooks=2, codebook_size=16):
        self.codebooks = [nn.Parameter(torch.randn(codebook_size)) for _ in range(n_codebooks)]
        self.indices = None  # Learned during calibration
    
    def dequantize(self):
        # Sum codebook entries for each weight
        reconstructed = sum(
            self.codebooks[i][self.indices[i]] 
            for i in range(len(self.codebooks))
        )
        return reconstructed

QuIP# improves on AQLM with randomized Hadamard transforms that “incohere” weight distributions, making them more amenable to uniform quantization. The result: 2-bit QuIP# models achieve perplexity within 15% of FP16—still useful for many applications.

The Production Deployment Matrix

Choosing a quantization strategy requires balancing model quality, inference speed, hardware compatibility, and operational complexity.

Framework Best For Quantization Memory Speed
vLLM High-throughput serving GPTQ, AWQ, FP8 Moderate Highest
TensorRT-LLM NVIDIA optimization INT8, FP8 Low Very High
llama.cpp CPU/Apple Silicon GGUF K-quants Lowest Moderate
MLX Apple Silicon INT4, FP16 Low High

For a 70B model on consumer hardware:

RTX 4090 (24 GB):
├── GPTQ-4bit: ✅ 35 GB → fits with offloading
├── AWQ-4bit:  ✅ 35 GB → fits with offloading
├── GGUF Q4_K_M: ❌ CPU only
└── FP8: ❌ Requires H100

MacBook Pro M3 Max (128 GB unified):
├── GGUF Q4_K_M: ✅ Full GPU acceleration
├── MLX INT4: ✅ Fast inference
└── vLLM: ❌ Not optimized for Metal

The Quality-Speed-Size Trilemma

Quantization forces choices. The fundamental trade-offs:

Perplexity vs. Compression:

  • FP16: 5.47 baseline
  • INT8: 5.48 (+0.2%)
  • INT4 GPTQ: 5.52 (+0.9%)
  • INT4 GGUF: 5.55 (+1.5%)
  • INT2 AQLM: 6.30 (+15%)

Speed vs. Quality:

  • AWQ: Fastest 4-bit, slight quality edge over GPTQ
  • GPTQ: Slower quantization, excellent quality
  • FP8: Hardware-accelerated on H100, limited compatibility

Memory vs. Complexity:

  • Per-tensor: Simplest, worst quality
  • Per-group: Best balance for 4-bit
  • Codebook (AQLM): Best for 2-bit, complex implementation

The emerging consensus: 4-bit per-group quantization (GPTQ or AWQ) offers the optimal trade-off for most production deployments. Reserve 2-bit for edge deployment where memory constraints dominate, and 8-bit for quality-critical applications.

The Future: Quantization-Aware Training

Post-training quantization treats the model as a black box. Quantization-aware training (QAT) integrates quantization into the training loop, allowing the model to adapt to reduced precision.

The technique: during training, forward passes use fake quantization:

$$\tilde{W} = \text{detach}(\text{round}(W/s) \cdot s)$$

The detach operation prevents gradients from flowing through the rounding operation, while the quantized values propagate through the network. The model learns weight distributions resilient to quantization.

Recent work from Unsloth demonstrates QAT can reduce 4-bit perplexity gap to under 0.5%—essentially closing the quality gap. The cost? Training time increases 20-30%, and requires access to training infrastructure.

For practitioners: PTQ remains the default choice. QAT becomes necessary only when:

  1. Quality degradation exceeds acceptable thresholds
  2. Target hardware has severe memory constraints
  3. Model architecture is quantization-hostile (e.g., Mamba with its state dependencies)

Quantization in Practice: Decision Framework

Start: What's your constraint?
│
├─ Memory-bound (consumer GPU)?
│   ├─ NVIDIA GPU → GPTQ-4bit or AWQ-4bit
│   ├─ AMD GPU → GGUF via llama.cpp
│   └─ Apple Silicon → GGUF Q4_K_M or MLX
│
├─ Throughput-bound (production serving)?
│   ├─ H100/H200 → FP8 via vLLM or TensorRT-LLM
│   ├─ A100 → INT8 via vLLM
│   └─ Consumer GPU → AWQ-4bit via vLLM
│
├─ Quality-critical?
│   ├─ INT8 via GPTQ-AWQ hybrid
│   └─ Consider QAT if PTQ fails
│
└─ Edge deployment (CPU/mobile)?
    ├─ GGUF Q4_K_M for laptops
    └─ AQLM/QuIP# 2-bit for extreme constraints

The mathematics of quantization reveals a profound truth about neural networks: they’re surprisingly robust to precision reduction. The information encoded in a 70-billion parameter model doesn’t require 16-bit floating point for every weight. Most parameters contribute marginally to output quality; protecting the vital few—whether through Hessian-aware compensation, activation-guided scaling, or codebook decomposition—preserves the model’s essence while enabling deployment scenarios previously impossible.

The next time you run a 70B model on your laptop, remember: you’re not running a degraded approximation. You’re running the same mathematical function, encoded more efficiently, made possible by a decade of quantization research that transformed “impossible” into routine.