Every time you increase a Transformer’s context window from 4K to 128K tokens, you’re asking the attention mechanism to compute a matrix 1,024 times larger. The O(n²) complexity isn’t a bug—it’s fundamental to how self-attention works. Every token must attend to every other token, creating a quadratic relationship that makes long-context models prohibitively expensive.
Mamba, introduced by Albert Gu and Tri Dao in December 2023, doesn’t just optimize around this constraint. It eliminates it entirely, replacing attention with selective state space models that scale linearly O(n) while matching Transformer quality. A Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size. The key insight? Making the model’s memory mechanism input-dependent—letting it choose what to remember and what to forget.
The State Space Foundation
State space models originate from control theory, describing how a system’s state evolves over time. The continuous-time formulation is deceptively simple:
$$h'(t) = Ah(t) + Bx(t)$$$$y(t) = Ch(t) + Dx(t)$$
Here, $x(t)$ is the input signal, $h(t)$ is the hidden state, and $y(t)$ is the output. The matrices $A$, $B$, $C$, and $D$ define the system’s dynamics. For deep learning, we discretize this continuous system:
$$h_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t$$$$y_t = C_t h_t$$
The discretization parameter $\Delta$ (delta) controls how finely we sample the continuous signal. This recurrence looks like an RNN—each state depends only on the previous state and current input, enabling O(1) memory per step during inference.
But here’s the problem: standard SSMs are time-invariant. The matrices $A$, $B$, $C$ are fixed regardless of input content. This makes them computationally efficient (you can use convolutions via FFT), but it also means they can’t perform content-based reasoning. An SSM processes the word “apple” the same way whether it appears in a fruit recipe or a tech company’s earnings report.
The HiPPO Matrix and Long-Range Memory
Before Mamba, Albert Gu’s foundational work on the HiPPO (High-order Polynomial Projection Operators) matrix showed how to initialize the $A$ matrix for long-range memory. The key insight: specific mathematical structures in $A$ allow the state to optimally compress the input history.
The HiPPO matrix enables SSMs to maintain information across extremely long sequences—a weakness of traditional RNNs. This led to the S4 (Structured State Spaces) model in 2021, which achieved Transformer-quality performance on long-sequence benchmarks like Long Range Arena. But S4 still couldn’t match Transformers on language modeling because it lacked content-awareness.
Mamba’s Selection Mechanism
Mamba’s breakthrough is making the SSM parameters input-dependent. Specifically, the $\Delta$ (step size), $B$ (input projection), and $C$ (output projection) become functions of the current input:
# Simplified selective SSM computation
def selective_ssm(x, delta, B, C, A):
# delta, B, C are now input-dependent (computed from x)
# A remains a learned diagonal matrix
h = zeros(state_dim)
outputs = []
for t in range(seq_len):
# Discretize: A_bar = exp(delta * A), B_bar = delta * B
A_bar = exp(delta[t] * A)
B_bar = delta[t] * B[t]
# Recurrent update
h = A_bar * h + B_bar * x[t]
y = C[t].T @ h
outputs.append(y)
return outputs
This selectivity is powerful. A large $\Delta$ means “pay attention to this token”—the state updates aggressively. A small $\Delta$ means “ignore this”—the state persists unchanged. When processing filler words like “um” or “uh,” the model can learn to suppress them. When encountering critical information, it can amplify the update.
The selection mechanism enables Mamba to perform the kind of content-based reasoning that makes attention so effective, while maintaining the linear complexity of recurrent models.
The Hardware-Aware Algorithm
Here’s the engineering challenge: selective SSMs can’t use convolutions. Since the parameters vary per timestep, the efficient FFT-based convolution trick from S4 doesn’t apply. Naively running the recurrence on GPU is slow—it doesn’t leverage parallelism.
Mamba’s solution is a hardware-aware parallel scan algorithm. The key insight is that the recurrence can be computed in parallel using prefix scan (also known as parallel prefix sum). Instead of processing tokens sequentially, the algorithm:
- Chunk computation into blocks that fit in GPU SRAM
- Recompute intermediate states during the backward pass instead of storing them
- Fuse kernels to minimize memory IO
The memory savings are dramatic. Standard approaches require O(BLDN) memory for batch size B, sequence length L, model dimension D, and state dimension N. Mamba’s implementation reduces this to O(BLD) by recomputing states during backpropagation, trading compute for memory—a favorable trade on modern GPUs with massive compute but limited memory bandwidth.
This is why Mamba achieves 5× higher inference throughput than Transformers. During inference, it processes one token at a time with constant memory, no KV-cache growing linearly with sequence length.
Mamba-2 and Structured State Space Duality
In May 2024, Dao and Gu released Mamba-2, which further refined the architecture through the Structured State Space Duality (SSD) framework. The key theoretical insight: selective SSMs with scalar-times-identity $A$ matrices are mathematically equivalent to a specific form of attention.
Consider the SSM output computation. For scalar-structured $A$, we can write:
$$M = L \circ CB^\top$$Where $L$ is a lower-triangular mask with exponentially decaying weights based on the $A$ values. This looks strikingly like attention:
$$\text{Attention}(Q, K, V) = \text{softmax}(QK^\top) \cdot V$$In fact, if all $a_t = 1$, then $L$ becomes the standard causal mask and the SSM computation is exactly causal linear attention with $(C, B, X)$ playing the roles of $(Q, K, V)$.
This duality is more than theoretical elegance. It enables a new SSD algorithm that:
- Uses matrix multiplications during training (leveraging GPU tensor cores)
- Switches to recurrent mode during inference (constant memory)
- Achieves the same FLOP count as SSMs while being 2-3× faster to train
Mamba-2 also increases the state dimension from N=16 in Mamba-1 to N=64-256, dramatically improving model capacity without the computational penalty that would plague attention-based models.
| Metric | Attention | SSM | SSD (Mamba-2) |
|---|---|---|---|
| State size | O(T) | O(N) | O(N) |
| Training FLOPs | O(T²N) | O(TN²) | O(TN²) |
| Inference FLOPs | O(TN) | O(N²) | O(N²) |
| Memory | O(T²) | O(TN²) | O(TN) |
| Matrix multiplications | ✓✓ | ✗✗ | ✓✓ |
The Hybrid Architecture: Jamba
The industry has already embraced hybrid approaches. AI21 Labs’ Jamba, released in March 2024, combines Transformer attention layers with Mamba SSM layers in a mixture-of-experts architecture. The 1.5 variant offers:
- 256K context window with 12B active parameters (52B total)
- 2.5× faster inference than comparable Transformer models
- Hybrid layers: alternating between Mamba and attention blocks (roughly 1:7 ratio)
The hybrid design captures the best of both worlds: attention’s strong in-context learning and copying abilities, plus Mamba’s efficient long-context handling. Jamba-1.5-Large achieves competitive MMLU scores while processing 256K tokens at speeds that would make a pure Transformer choke.
This hybrid approach is becoming the de facto standard. NVIDIA’s NeMo framework, Meta’s research, and numerous open-source projects are exploring similar combinations.
What Mamba Struggles With
Mamba isn’t a universal replacement. Research has identified specific weaknesses:
The Copying Problem: Pure Mamba models struggle with exact token copying. A 2024 paper from Harvard researchers showed that while Transformers excel at copying sequences verbatim (useful for tasks like code generation), Mamba’s compressed state representation can lose precise positional information.
In-Context Learning: On few-shot benchmarks like MMLU 5-shot, pure Mamba underperforms Transformers. The 5-shot MMLU score for Mamba-2 is around 29.2% compared to 46.3% for equivalently-sized Transformers. Hybrid models close this gap significantly.
Training Instability: The selective scan’s input-dependent parameters can create training dynamics that are harder to tune than attention’s more stable gradients. Techniques like gradient clipping and careful initialization become more critical.
Ecosystem Maturity: FlashAttention, vLLM, and other Transformer optimization tools represent years of engineering investment. Mamba’s infrastructure is newer and less battle-tested.
Real-World Performance
On concrete benchmarks, Mamba’s advantages shine for long sequences:
- Language Modeling: Mamba-3B matches the perplexity of a 6B Transformer on The Pile dataset
- DNA Modeling: On the HG5 DNA benchmark (sequences up to 1M tokens), Mamba significantly outperforms Transformers
- Audio Processing: Raw audio generation tasks show 10× speedups for long audio sequences
- Inference Throughput: ~300 tokens/second for Mamba-7B on consumer GPUs, with memory usage independent of sequence length
The inference advantage compounds at scale. A model processing 100K tokens with a Transformer needs a KV-cache of roughly 200GB for a 70B model. Mamba needs just the model weights and a fixed-size state—orders of magnitude less memory.
The Road Ahead
Mamba represents a fundamental shift in how we think about sequence modeling. The O(n²) barrier wasn’t a law of nature—it was a property of a specific architectural choice. By revisiting first principles from control theory and adding input-dependent selectivity, Gu and Dao demonstrated that linear-time sequence modeling can match Transformer quality.
The research directions are now clear:
- Better hybrid designs: Finding optimal ratios of attention-to-SSM layers for different tasks
- Improved selectivity: Exploring richer input-dependent parameterizations
- Multimodal extensions: Adapting the selective scan for 2D (images) and 3D (video) data
- Training stability: Developing better optimization techniques for SSMs
The era of Transformer dominance isn’t over—attention remains remarkably effective for many tasks. But Mamba proved that we’re not stuck with quadratic complexity forever. For applications demanding million-token contexts, real-time inference, or memory-constrained deployment, state space models have opened a door that was previously nailed shut.