How Flash Attention Revolutionized LLM Training: The IO-Aware Algorithm Behind Modern Long-Context Models

In 2022, training a transformer with 16K context length required either massive GPU memory or accepting severe approximations. Standard attention’s memory grew quadratically with sequence length—a 32K context demanded over 4GB just for intermediate attention matrices. Then Flash Attention arrived, reducing memory from $O(N^2)$ to $O(N)$ while computing exact attention, not an approximation. This breakthrough enabled GPT-4’s 128K context window, Llama’s extended sequences, and virtually every modern long-context LLM. The key insight wasn’t algorithmic cleverness alone—it was understanding that on modern GPUs, memory bandwidth, not compute, is the bottleneck. ...

10 min · 1924 words

When 90% of Your KV Cache Doesn't Matter: The Mathematics Behind Intelligent Token Eviction

A 70B parameter model with a 128K context window needs approximately 40 GB of GPU memory just for the KV cache. That’s before counting model weights, activations, or any other overhead. TheKV cache grows linearly with sequence length, creating a fundamental barrier to long-context inference that no amount of GPU memory can solve. The breakthrough came from an unexpected observation: most tokens in your KV cache contribute almost nothing to the final output. Researchers discovered that intelligently evicting 90% of cached tokens often results in negligible accuracy loss. This isn’t compression through quantization—it’s compression through understanding which tokens actually matter. ...

7 min · 1331 words