Flash Attention

In 2022, training a transformer with 16K context length required either massive GPU memory or accepting severe approximations. Standard attention’s memory grew quadratically with sequence length—a 32K context demanded over 4GB just for intermediate attention matrices. Then Flash Attention arrived, reducing memory from $O(N^2)$ to $O(N)$ while computing exact attention, not an approximation. This breakthrough enabled GPT-4’s 128K context window, Llama’s extended sequences, and virtually every modern long-context LLM. The key insight wasn’t algorithmic cleverness alone—it was understanding that on modern GPUs, memory bandwidth, not compute, is the bottleneck. ...

Flash Attention

How Flash Attention Revolutionized LLM Training: The IO-Aware Algorithm Behind Modern Long-Context Models

How Ring Attention Breaks the Memory Barrier: Enabling Million-Token Contexts Through Distributed Computation