LLM Architecture

Every time you increase a Transformer’s context window from 4K to 128K tokens, you’re asking the attention mechanism to compute a matrix 1,024 times larger. The O(n²) complexity isn’t a bug—it’s fundamental to how self-attention works. Every token must attend to every other token, creating a quadratic relationship that makes long-context models prohibitively expensive. Mamba, introduced by Albert Gu and Tri Dao in December 2023, doesn’t just optimize around this constraint. It eliminates it entirely, replacing attention with selective state space models that scale linearly O(n) while matching Transformer quality. A Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size. The key insight? Making the model’s memory mechanism input-dependent—letting it choose what to remember and what to forget. ...

LLM Architecture

How Mamba Broke the O(n²) Barrier: The Mathematics Behind Linear-Time Sequence Modeling

How Mixture of Experts Scales to Trillion Parameters: The Sparse Architecture Revolution Behind Modern LLMs