A 70-billion parameter model requires 140 GB of GPU memory in FP16. A consumer RTX 4090 has 24 GB. This arithmetic gap defined the boundary between “enterprise AI” and “what you can run at home” until quantization mathematics cracked the code.
The counterintuitive reality: reducing precision from 16 bits to 4 bits—a 75% compression—often preserves over 95% of model quality. Not through magic, but through a profound understanding of how neural networks encode information.
The Quantization Equation: From FP16 to INT4
At its core, quantization maps continuous floating-point values to discrete integer representations. For a weight matrix $W$, the quantized representation $\hat{W}$ follows:
$$\hat{W} = \text{round}\left(\frac{W - z}{s}\right) \cdot s + z$$Where $s$ is the scale factor and $z$ is the zero-point offset. The deceptively simple formula hides immense complexity: choosing $s$ and $z$ optimally determines whether your model degrades gracefully or catastrophically.
The quantization error for uniform quantization follows:
$$\text{SQNR} \approx 6.02n + 1.76 \text{ dB}$$Where $n$ is the number of bits. This classic signal processing result suggests 4-bit quantization should introduce significant noise. Yet LLMs thrive. Why?
# The fundamental quantization operation
def quantize(tensor, n_bits=4):
# Find range
min_val, max_val = tensor.min(), tensor.max()
# Calculate scale and zero-point
qmin, qmax = 0, 2**n_bits - 1
scale = (max_val - min_val) / (qmax - qmin)
zero_point = qmin - min_val / scale
# Quantize and dequantize
quantized = torch.clamp(
torch.round(tensor / scale + zero_point),
qmin, qmax
)
return quantized, scale, zero_point
The answer lies in weight distribution. LLM weights follow approximately Gaussian distributions centered near zero. Most information concentrates in a narrow band, with extreme values representing a tiny fraction. Per-tensor quantization wastes precision on outliers while under-representing the dense center.
The Granularity Hierarchy: Per-Tensor to Per-Channel
Modern quantization employs progressively finer granularity:
| Granularity | Parameters | Memory Overhead | Quality |
|---|---|---|---|
| Per-tensor | 1 scale per layer | Negligible | Poor |
| Per-channel | 1 scale per output channel | 0.01% | Good |
| Per-group | 1 scale per 128 weights | 0.5% | Excellent |
| Per-vector | Learned codebooks | 1-2% | Near-lossless |
Per-group quantization, introduced in GPTQ and refined in AWQ, partitions weights into small groups (typically 64-128 elements), each with independent scaling. This local adaptation captures the varying dynamic ranges across a weight matrix.
# Per-group quantization splits the weight matrix
# Group size g = 128 typical for 4-bit quantization
def group_quantize(weight, group_size=128):
out_features, in_features = weight.shape
weight = weight.reshape(-1, group_size)
# Independent scaling per group
scales = weight.abs().max(dim=-1).values / (2**(n_bits-1) - 1)
quantized = torch.round(weight / scales.unsqueeze(-1))
return quantized, scales
The overhead? For a 7B model with group size 128, storing scales requires approximately 40 MB—a 0.5% memory tax for substantial quality gains.
GPTQ: The Hessian-Guided Revolution
In March 2023, researchers from ETH Zurich and IST Austria published a paper that became the industry standard. GPTQ (Gradient Post-Training Quantization) builds on Optimal Brain Quantization (OBQ), but with a crucial insight: processing weights in reverse order enables recursive error compensation.
The algorithm minimizes layer-wise output error:
$$\arg\min_{\hat{W}} \|WX - \hat{W}X\|^2$$GPTQ maintains a Hessian inverse $H^{-1} = (XX^T)^{-1}$ updated after each weight quantization. When quantizing weight $w_i$ to $\hat{w}_i$, the error propagates to remaining weights:
$$\delta_F = -\frac{w_q - \hat{w}_q}{[H^{-1}]_{qq}} \cdot H^{-1}_{:,q}$$This recursive compensation means early quantization errors get absorbed by later adjustments—a mathematical sleight of hand that preserves output fidelity.
# Simplified GPTQ core loop
def gptq_layer(W, H_inv, quantizer):
for i in reversed(range(W.shape[1])):
# Quantize current column
w_q = quantizer.quantize(W[:, i])
# Calculate error
error = (W[:, i] - w_q) / H_inv[i, i]
# Propagate error to remaining columns
W[:, :i] -= error.unsqueeze(1) * H_inv[i, :i].unsqueeze(0)
# Update Hessian inverse (Cholesky decomposition in practice)
H_inv = update_hessian_inverse(H_inv, i)
return quantizer.quantize(W)
The practical impact: a LLaMA-70B model quantized to 4-bit with GPTQ achieves 99.2% of its original MMLU score while requiring only 35 GB VRAM—a 75% reduction.
AWQ: Protecting the Vital 1%
MIT’s Song Han lab observed something peculiar in 2023: protecting merely 1% of weights at full precision preserves nearly all model quality. The insight launched Activation-aware Weight Quantization (AWQ).
The key observation: weight importance correlates with activation magnitude. A weight connected to high-activation channels disproportionately influences output. AWQ identifies these “salient weights” by running a calibration dataset through the model and recording activation magnitudes.
$$s_j = \max_i |X_{ij}|$$The scale factor $s_j$ for channel $j$ derives from maximum activation. Channels with higher activations receive proportionally larger quantization regions.
# AWQ salient weight identification
def compute_importance(activations):
# activations: [batch, seq_len, hidden_dim]
importance = activations.abs().max(dim=0).values.max(dim=0).values
return importance
def awq_quantize(W, activation_scales, group_size=128):
# Scale weights to protect salient channels
W_scaled = W * activation_scales.unsqueeze(0)
# Group-wise quantization
Q, scales = group_quantize(W_scaled, group_size)
# Inverse scale during dequantization
return Q, scales, activation_scales
AWQ’s advantage over GPTQ: no per-weight error propagation means faster quantization (seconds vs. minutes) and comparable quality. The trade-off? AWQ requires a calibration dataset; GPTQ works with random data.
| Method | Calibration | Speed | Perplexity (Llama-2-7B) |
|---|---|---|---|
| FP16 | — | — | 5.47 |
| RTN (round-to-nearest) | None | Instant | 12.30 |
| GPTQ-4bit | Random | Slow | 5.52 |
| AWQ-4bit | Calibration | Fast | 5.50 |
| GGUF Q4_K_M | None | Fast | 5.55 |
GGUF and K-Quants: The Consumer Revolution
While GPTQ and AWQ target GPU inference, the llama.cpp project democratized LLMs for CPU deployment. Its GGUF format introduced “K-quants”—a family of mixed-precision quantization schemes that balance size, speed, and quality.
The innovation: not all tensors deserve equal precision. Attention layers tolerate aggressive quantization; FFN layers require higher fidelity. K-quants exploit this heterogeneity.
The Q4_K_M scheme applies:
- 4-bit quantization to most weights
- 6-bit quantization to critical attention layers
- Mixed group sizes (32, 64, 128) based on tensor importance
# K-quant block structure
# Super-block of 256 weights → subdivided into sub-blocks
class KQuantBlock:
def __init__(self, weights, n_bits=4):
# Super-block: 256 weights
self.super_scale = weights.abs().max()
# Sub-blocks: 32 weights each, 8 sub-blocks
self.sub_scales = weights.reshape(8, 32).abs().max(dim=-1)
# Quantize with hierarchical scaling
self.quantized = self._quantize_hierarchical(weights)
The result: Q4_K_M achieves perplexity within 2% of FP16 while halving model size. For CPU inference on Apple Silicon, K-quants with Metal acceleration achieve 30+ tokens/second on a MacBook Pro—making local LLMs practical for everyday use.
FP8: The Hardware-Accelerated Frontier
NVIDIA’s Hopper architecture (H100/H200) introduced native FP8 support—a watershed moment for quantization. Unlike INT8, FP8 preserves the dynamic range of floating-point while halving memory bandwidth.
FP8 comes in two formats:
| Format | Exponent | Mantissa | Range | Use Case |
|---|---|---|---|---|
| E4M3 | 4 bits | 3 bits | ±448 | Weights, activations |
| E5M2 | 5 bits | 2 bits | ±57,344 | Gradients |
The E4M3 format provides sufficient precision for inference; E5M2 handles gradient computation during training. Combined with Tensor Core acceleration, FP8 enables 2× throughput improvement over FP16.
# FP8 quantization with E4M3 format
def fp8_e4m3_quantize(tensor):
# E4M3: 4 exponent bits, 3 mantissa bits
# Range: [-448, 448], no inf/nan
max_val = 448.0
# Clamp to representable range
tensor_clamped = torch.clamp(tensor, -max_val, max_val)
# Convert to FP8 representation
fp8_tensor = tensor_clamped.half() # Intermediate
# Hardware-native FP8 operations on H100+
return fp8_tensor
DeepSeek-V3’s training leveraged FP8 extensively, achieving 2× training efficiency while maintaining model quality. The implication: FP8 isn’t just an inference optimization—it’s reshaping how we train trillion-parameter models.
The Outlier Problem: Why Quantization Fails
Not all quantization stories end happily. In 2022, researchers discovered that LLMs develop “activation outliers”—channels with magnitudes 100× larger than typical. These outliers break naive quantization.
The mechanism: during training, certain attention heads develop outsized influence. Their outputs exhibit massive dynamic range, concentrated in specific feature dimensions. Per-tensor quantization, forced to accommodate these outliers, uses a scale factor that crushes precision for the remaining 99% of activations.
$$\text{Scale} = \frac{\max(|\text{activations}|)}{2^n - 1}$$If one channel reaches 100 while others hover around 1, the quantization step becomes 100/15 ≈ 6.7 for 4-bit—obliterating fine distinctions in the majority of channels.
SmoothQuant (2022) proposed an elegant solution: migrate quantization difficulty from activations to weights. Since weights are static, they can be pre-processed:
$$\tilde{X} = \text{diag}(s)^{-1} X, \quad \tilde{W} = \text{diag}(s) W$$The smoothing factor $s_j = \max(|X_j|)^\alpha / \max(|W_j|)^{1-\alpha}$ balances activation and weight magnitudes. With $\alpha \approx 0.5$, SmoothQuant enables INT8 quantization with near-zero quality loss.
# SmoothQuant activation smoothing
def smooth_layer(weight, activations, alpha=0.5):
# Compute smoothing scales
activation_scales = activations.abs().max(dim=0).values
weight_scales = weight.abs().max(dim=0).values
smooth_scale = (activation_scales ** alpha) / (weight_scales ** (1 - alpha))
# Migrate quantization difficulty
smoothed_weight = weight * smooth_scale.unsqueeze(0)
smoothed_activation_scale = activation_scales / smooth_scale
return smoothed_weight, smoothed_activation_scale
KV Cache Quantization: The Long-Context Memory Saver
For long-context inference, the KV cache dominates memory consumption. A 70B model processing 128K tokens requires approximately 80 GB just for KV storage. Quantization offers a 2-4× reduction.
The challenge: KV cache values are dynamic, generated during inference. Static calibration methods don’t apply. The solution? Per-token scaling with FP8.
# KV cache FP8 quantization
class QuantizedKVCache:
def __init__(self, n_heads, head_dim, max_seq_len):
# FP8 storage for K and V
self.k_cache = torch.zeros(max_seq_len, n_heads, head_dim, dtype=torch.float8_e4m3fn)
self.v_cache = torch.zeros(max_seq_len, n_heads, head_dim, dtype=torch.float8_e4m3fn)
# Per-token scale factors
self.k_scales = torch.zeros(max_seq_len)
self.v_scales = torch.zeros(max_seq_len)
def update(self, keys, values, positions):
# Dynamic per-token quantization
self.k_scales[positions] = keys.abs().max(dim=-1).values
self.v_scales[positions] = values.abs().max(dim=-1).values
self.k_cache[positions] = (keys / self.k_scales[positions].unsqueeze(-1)).to(torch.float8_e4m3fn)
self.v_cache[positions] = (values / self.v_scales[positions].unsqueeze(-1)).to(torch.float8_e4m3fn)
vLLM’s FP8 KV cache implementation achieves 2× memory reduction with perplexity degradation under 1%. For production systems serving long contexts, this translates to 2× concurrent requests or 2× context length per request.
Extreme Quantization: The 2-Bit Frontier
What happens below 4 bits? The perplexity gap widens dramatically—unless you use specialized algorithms.
AQLM (Additive Quantization for LLMs) introduced codebook-based compression:
$$w \approx \sum_{i=1}^{k} c_{i, j_i}$$Each weight becomes a sum of $k$ codebook entries. For 2-bit quantization ($k=2$ codebooks with 16 entries each), this achieves additive precision: 4 bits effective representation from 2-bit storage.
# Simplified AQLM additive quantization
class AQLMQuantizer:
def __init__(self, n_codebooks=2, codebook_size=16):
self.codebooks = [nn.Parameter(torch.randn(codebook_size)) for _ in range(n_codebooks)]
self.indices = None # Learned during calibration
def dequantize(self):
# Sum codebook entries for each weight
reconstructed = sum(
self.codebooks[i][self.indices[i]]
for i in range(len(self.codebooks))
)
return reconstructed
QuIP# improves on AQLM with randomized Hadamard transforms that “incohere” weight distributions, making them more amenable to uniform quantization. The result: 2-bit QuIP# models achieve perplexity within 15% of FP16—still useful for many applications.
The Production Deployment Matrix
Choosing a quantization strategy requires balancing model quality, inference speed, hardware compatibility, and operational complexity.
| Framework | Best For | Quantization | Memory | Speed |
|---|---|---|---|---|
| vLLM | High-throughput serving | GPTQ, AWQ, FP8 | Moderate | Highest |
| TensorRT-LLM | NVIDIA optimization | INT8, FP8 | Low | Very High |
| llama.cpp | CPU/Apple Silicon | GGUF K-quants | Lowest | Moderate |
| MLX | Apple Silicon | INT4, FP16 | Low | High |
For a 70B model on consumer hardware:
RTX 4090 (24 GB):
├── GPTQ-4bit: ✅ 35 GB → fits with offloading
├── AWQ-4bit: ✅ 35 GB → fits with offloading
├── GGUF Q4_K_M: ❌ CPU only
└── FP8: ❌ Requires H100
MacBook Pro M3 Max (128 GB unified):
├── GGUF Q4_K_M: ✅ Full GPU acceleration
├── MLX INT4: ✅ Fast inference
└── vLLM: ❌ Not optimized for Metal
The Quality-Speed-Size Trilemma
Quantization forces choices. The fundamental trade-offs:
Perplexity vs. Compression:
- FP16: 5.47 baseline
- INT8: 5.48 (+0.2%)
- INT4 GPTQ: 5.52 (+0.9%)
- INT4 GGUF: 5.55 (+1.5%)
- INT2 AQLM: 6.30 (+15%)
Speed vs. Quality:
- AWQ: Fastest 4-bit, slight quality edge over GPTQ
- GPTQ: Slower quantization, excellent quality
- FP8: Hardware-accelerated on H100, limited compatibility
Memory vs. Complexity:
- Per-tensor: Simplest, worst quality
- Per-group: Best balance for 4-bit
- Codebook (AQLM): Best for 2-bit, complex implementation
The emerging consensus: 4-bit per-group quantization (GPTQ or AWQ) offers the optimal trade-off for most production deployments. Reserve 2-bit for edge deployment where memory constraints dominate, and 8-bit for quality-critical applications.
The Future: Quantization-Aware Training
Post-training quantization treats the model as a black box. Quantization-aware training (QAT) integrates quantization into the training loop, allowing the model to adapt to reduced precision.
The technique: during training, forward passes use fake quantization:
$$\tilde{W} = \text{detach}(\text{round}(W/s) \cdot s)$$The detach operation prevents gradients from flowing through the rounding operation, while the quantized values propagate through the network. The model learns weight distributions resilient to quantization.
Recent work from Unsloth demonstrates QAT can reduce 4-bit perplexity gap to under 0.5%—essentially closing the quality gap. The cost? Training time increases 20-30%, and requires access to training infrastructure.
For practitioners: PTQ remains the default choice. QAT becomes necessary only when:
- Quality degradation exceeds acceptable thresholds
- Target hardware has severe memory constraints
- Model architecture is quantization-hostile (e.g., Mamba with its state dependencies)
Quantization in Practice: Decision Framework
Start: What's your constraint?
│
├─ Memory-bound (consumer GPU)?
│ ├─ NVIDIA GPU → GPTQ-4bit or AWQ-4bit
│ ├─ AMD GPU → GGUF via llama.cpp
│ └─ Apple Silicon → GGUF Q4_K_M or MLX
│
├─ Throughput-bound (production serving)?
│ ├─ H100/H200 → FP8 via vLLM or TensorRT-LLM
│ ├─ A100 → INT8 via vLLM
│ └─ Consumer GPU → AWQ-4bit via vLLM
│
├─ Quality-critical?
│ ├─ INT8 via GPTQ-AWQ hybrid
│ └─ Consider QAT if PTQ fails
│
└─ Edge deployment (CPU/mobile)?
├─ GGUF Q4_K_M for laptops
└─ AQLM/QuIP# 2-bit for extreme constraints
The mathematics of quantization reveals a profound truth about neural networks: they’re surprisingly robust to precision reduction. The information encoded in a 70-billion parameter model doesn’t require 16-bit floating point for every weight. Most parameters contribute marginally to output quality; protecting the vital few—whether through Hessian-aware compensation, activation-guided scaling, or codebook decomposition—preserves the model’s essence while enabling deployment scenarios previously impossible.
The next time you run a 70B model on your laptop, remember: you’re not running a degraded approximation. You’re running the same mathematical function, encoded more efficiently, made possible by a decade of quantization research that transformed “impossible” into routine.