How Mamba Broke the O(n²) Barrier: The Mathematics Behind Linear-Time Sequence Modeling

Every time you increase a Transformer’s context window from 4K to 128K tokens, you’re asking the attention mechanism to compute a matrix 1,024 times larger. The O(n²) complexity isn’t a bug—it’s fundamental to how self-attention works. Every token must attend to every other token, creating a quadratic relationship that makes long-context models prohibitively expensive. Mamba, introduced by Albert Gu and Tri Dao in December 2023, doesn’t just optimize around this constraint. It eliminates it entirely, replacing attention with selective state space models that scale linearly O(n) while matching Transformer quality. A Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size. The key insight? Making the model’s memory mechanism input-dependent—letting it choose what to remember and what to forget. ...

8 min · 1495 words

How Mixture of Experts Scales to Trillion Parameters: The Sparse Architecture Revolution Behind Modern LLMs

When DeepSeek-V3 was released in December 2024, it achieved something remarkable: a 671-billion-parameter model that activates only 37 billion parameters per token. This isn’t a magic trick—it’s the power of Mixture of Experts (MoE), an architectural paradigm that has quietly become the backbone of nearly every frontier large language model. The math is compelling. A dense 671B model would require approximately 1,342 TFLOPs per token during inference. DeepSeek-V3 achieves comparable performance with roughly 74 TFLOPs—an 18x reduction in compute. This isn’t incremental optimization; it’s a fundamental rethinking of how neural networks scale. ...

9 min · 1822 words