How Flash Attention Revolutionized LLM Training: The IO-Aware Algorithm Behind Modern Long-Context Models

In 2022, training a transformer with 16K context length required either massive GPU memory or accepting severe approximations. Standard attention’s memory grew quadratically with sequence length—a 32K context demanded over 4GB just for intermediate attention matrices. Then Flash Attention arrived, reducing memory from $O(N^2)$ to $O(N)$ while computing exact attention, not an approximation. This breakthrough enabled GPT-4’s 128K context window, Llama’s extended sequences, and virtually every modern long-context LLM. The key insight wasn’t algorithmic cleverness alone—it was understanding that on modern GPUs, memory bandwidth, not compute, is the bottleneck. ...

10 min · 1924 words

When the Path Matters More Than the Answer: How Process Reward Models Transform LLM Reasoning

A math student solves a complex integration problem. Her final answer is correct, but halfway through, she made a sign error that accidentally canceled out in the next step. The teacher gives full marks—after all, the answer is right. But should it count? This scenario captures the fundamental flaw in how we’ve traditionally evaluated Large Language Model (LLM) reasoning: Outcome Reward Models (ORMs) only check the final destination, ignoring whether the path was sound. Process Reward Models (PRMs) represent a paradigm shift—verifying every step of reasoning, catching those hidden errors that coincidentally produce correct answers, and enabling the test-time scaling that powers reasoning models like OpenAI’s o1 and DeepSeek-R1. ...

7 min · 1473 words

How 4 Bits Preserves 99% Quality: The Mathematics Behind LLM Quantization

A 70-billion parameter model requires 140 GB of GPU memory in FP16. A consumer RTX 4090 has 24 GB. This arithmetic gap defined the boundary between “enterprise AI” and “what you can run at home” until quantization mathematics cracked the code. The counterintuitive reality: reducing precision from 16 bits to 4 bits—a 75% compression—often preserves over 95% of model quality. Not through magic, but through a profound understanding of how neural networks encode information. ...

11 min · 2201 words

The Hidden Memory Tax: Why Your 80GB GPU Still Can't Handle Long-Context LLMs

In March 2024, a team of researchers attempted to deploy a 70-billion parameter language model on a single NVIDIA H100 GPU with 80GB of VRAM. The model weights alone consumed approximately 140GB in FP16—already exceeding their hardware capacity. But even after applying 4-bit quantization to squeeze the weights down to ~40GB, the system still ran out of memory when processing contexts beyond 8,000 tokens. The culprit wasn’t the model size. It was something far more insidious: the KV cache. ...

9 min · 1846 words