Deep Learning Architecture

Beyond Bolt-On Vision: How Native Multimodal Models Are Rewriting the Architecture of AI

For years, the dominant approach to multimodal AI followed a simple recipe: take a pre-trained vision encoder (CLIP, SigLIP), bolt it onto a pre-trained LLM through an adapter layer, and fine-tune the connection. This “late-fusion” paradigm powered everything from GPT-4V to LLaVA, delivering impressive results with remarkable sample efficiency. But a fundamental question lingered: was this architectural shortcut an inherent advantage, or merely a convenient workaround? The answer arrived in 2025 with a paradigm shift that’s rewriting the rules of multimodal AI. Native multimodal models—trained from scratch on all modalities simultaneously—are proving that early-fusion architectures don’t just match late-fusion approaches; they exceed them in efficiency, scalability, and ultimately, capability. ...

When AI Learns to Remember: How Google's Titans Architecture Solved the Long-Term Memory Problem

The Transformer architecture revolutionized machine learning with its attention mechanism, enabling models to capture dependencies across entire sequences. Yet despite their dominance, Transformers suffer from a fundamental limitation: they have amnesia. Every token beyond the context window vanishes into oblivion, and even within that window, the quadratic complexity of attention makes scaling prohibitively expensive. In December 2024, Google Research introduced Titans, a new family of architectures that fundamentally rethinks how neural networks handle memory. The breakthrough isn’t just another efficiency trick—it’s a paradigm shift that treats memory itself as a learnable neural network, updated in real-time during inference through gradient descent on a surprise-based objective. ...

When the Hidden State Becomes the Model: How Test-Time Training Rewrites the Rules of Sequence Modeling

The long-context problem has haunted transformer architectures since their inception. While self-attention’s $O(n^2)$ complexity is well-known, the real tragedy lies deeper: even modern RNNs like Mamba, despite their linear complexity, plateau after 16K tokens. They simply cannot compress enough information into their fixed-size hidden states. What if the hidden state wasn’t a fixed-size bottleneck, but a model that could grow in capacity through learning—even at test time? This is the radical proposition of Test-Time Training (TTT), introduced by Stanford researchers in July 2024 and extended to production-ready systems by NVIDIA and Stanford in December 2025. The results are striking: TTT-Linear matches Transformer performance while maintaining RNN efficiency, and the latest TTT-E2E achieves 2.7x faster inference than full attention at 128K context length. ...

How Mamba Broke the O(n²) Barrier: The Mathematics Behind Linear-Time Sequence Modeling

Every time you increase a Transformer’s context window from 4K to 128K tokens, you’re asking the attention mechanism to compute a matrix 1,024 times larger. The O(n²) complexity isn’t a bug—it’s fundamental to how self-attention works. Every token must attend to every other token, creating a quadratic relationship that makes long-context models prohibitively expensive. Mamba, introduced by Albert Gu and Tri Dao in December 2023, doesn’t just optimize around this constraint. It eliminates it entirely, replacing attention with selective state space models that scale linearly O(n) while matching Transformer quality. A Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size. The key insight? Making the model’s memory mechanism input-dependent—letting it choose what to remember and what to forget. ...

How Mixture of Experts Scales to Trillion Parameters: The Sparse Architecture Revolution Behind Modern LLMs

When DeepSeek-V3 was released in December 2024, it achieved something remarkable: a 671-billion-parameter model that activates only 37 billion parameters per token. This isn’t a magic trick—it’s the power of Mixture of Experts (MoE), an architectural paradigm that has quietly become the backbone of nearly every frontier large language model. The math is compelling. A dense 671B model would require approximately 1,342 TFLOPs per token during inference. DeepSeek-V3 achieves comparable performance with roughly 74 TFLOPs—an 18x reduction in compute. This isn’t incremental optimization; it’s a fundamental rethinking of how neural networks scale. ...