Linear Attention

The long-context problem has haunted transformer architectures since their inception. While self-attention’s $O(n^2)$ complexity is well-known, the real tragedy lies deeper: even modern RNNs like Mamba, despite their linear complexity, plateau after 16K tokens. They simply cannot compress enough information into their fixed-size hidden states. What if the hidden state wasn’t a fixed-size bottleneck, but a model that could grow in capacity through learning—even at test time? This is the radical proposition of Test-Time Training (TTT), introduced by Stanford researchers in July 2024 and extended to production-ready systems by NVIDIA and Stanford in December 2025. The results are striking: TTT-Linear matches Transformer performance while maintaining RNN efficiency, and the latest TTT-E2E achieves 2.7x faster inference than full attention at 128K context length. ...

Every time you increase a Transformer’s context window from 4K to 128K tokens, you’re asking the attention mechanism to compute a matrix 1,024 times larger. The O(n²) complexity isn’t a bug—it’s fundamental to how self-attention works. Every token must attend to every other token, creating a quadratic relationship that makes long-context models prohibitively expensive. Mamba, introduced by Albert Gu and Tri Dao in December 2023, doesn’t just optimize around this constraint. It eliminates it entirely, replacing attention with selective state space models that scale linearly O(n) while matching Transformer quality. A Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size. The key insight? Making the model’s memory mechanism input-dependent—letting it choose what to remember and what to forget. ...

Linear Attention

When the Hidden State Becomes the Model: How Test-Time Training Rewrites the Rules of Sequence Modeling

How Mamba Broke the O(n²) Barrier: The Mathematics Behind Linear-Time Sequence Modeling