When the Hidden State Becomes the Model: How Test-Time Training Rewrites the Rules of Sequence Modeling

The long-context problem has haunted transformer architectures since their inception. While self-attention’s $O(n^2)$ complexity is well-known, the real tragedy lies deeper: even modern RNNs like Mamba, despite their linear complexity, plateau after 16K tokens. They simply cannot compress enough information into their fixed-size hidden states. What if the hidden state wasn’t a fixed-size bottleneck, but a model that could grow in capacity through learning—even at test time? This is the radical proposition of Test-Time Training (TTT), introduced by Stanford researchers in July 2024 and extended to production-ready systems by NVIDIA and Stanford in December 2025. The results are striking: TTT-Linear matches Transformer performance while maintaining RNN efficiency, and the latest TTT-E2E achieves 2.7x faster inference than full attention at 128K context length. ...

9 min · 1743 words

How Mamba Broke the O(n²) Barrier: The Mathematics Behind Linear-Time Sequence Modeling

Every time you increase a Transformer’s context window from 4K to 128K tokens, you’re asking the attention mechanism to compute a matrix 1,024 times larger. The O(n²) complexity isn’t a bug—it’s fundamental to how self-attention works. Every token must attend to every other token, creating a quadratic relationship that makes long-context models prohibitively expensive. Mamba, introduced by Albert Gu and Tri Dao in December 2023, doesn’t just optimize around this constraint. It eliminates it entirely, replacing attention with selective state space models that scale linearly O(n) while matching Transformer quality. A Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size. The key insight? Making the model’s memory mechanism input-dependent—letting it choose what to remember and what to forget. ...

8 min · 1495 words

How Mixture of Experts Scales to Trillion Parameters: The Sparse Architecture Revolution Behind Modern LLMs

When DeepSeek-V3 was released in December 2024, it achieved something remarkable: a 671-billion-parameter model that activates only 37 billion parameters per token. This isn’t a magic trick—it’s the power of Mixture of Experts (MoE), an architectural paradigm that has quietly become the backbone of nearly every frontier large language model. The math is compelling. A dense 671B model would require approximately 1,342 TFLOPs per token during inference. DeepSeek-V3 achieves comparable performance with roughly 74 TFLOPs—an 18x reduction in compute. This isn’t incremental optimization; it’s a fundamental rethinking of how neural networks scale. ...

9 min · 1822 words