When 10% Attention Beats 100%: The Mathematics Behind Sparse LLM Inference

The quadratic complexity of self-attention has haunted transformer architecture since its inception. As context windows expanded from 2K to 1M tokens, the O(N²) attention computation transformed from an annoyance into an existential bottleneck. Yet a counterintuitive discovery emerged in 2025-2026: computing only 5-20% of attention weights can match or exceed full attention performance. This isn’t compression with acceptable loss—it’s the revelation that transformers have been computing billions of unnecessary operations. The mathematics behind this phenomenon, and the engineering that exploits it, represents one of the most significant advances in LLM efficiency. ...

10 min · 2056 words

When AI Learns to Remember: How Google's Titans Architecture Solved the Long-Term Memory Problem

The Transformer architecture revolutionized machine learning with its attention mechanism, enabling models to capture dependencies across entire sequences. Yet despite their dominance, Transformers suffer from a fundamental limitation: they have amnesia. Every token beyond the context window vanishes into oblivion, and even within that window, the quadratic complexity of attention makes scaling prohibitively expensive. In December 2024, Google Research introduced Titans, a new family of architectures that fundamentally rethinks how neural networks handle memory. The breakthrough isn’t just another efficiency trick—it’s a paradigm shift that treats memory itself as a learnable neural network, updated in real-time during inference through gradient descent on a surprise-based objective. ...

8 min · 1691 words