How Recursive Language Models Break the Context Ceiling: Processing 10M+ Tokens Without Expanding the Window

The race for larger context windows has defined LLM development for years. From GPT-4’s 128K tokens to Gemini’s 1M and beyond, the assumption has been simple: more context equals better performance. But a January 2026 paper from MIT CSAIL challenges this assumption entirely. Recursive Language Models (RLMs) don’t expand the context window—they render it irrelevant by treating prompts as external environments that models can programmatically explore, decompose, and recursively process. ...

7 min · 1468 words

When Not Every Token Deserves the Same Compute: How Mixture-of-Depths Rewrites Transformer Efficiency

Every transformer you’ve ever used treats every token with the same computational respect. Whether processing “the” or untangling complex mathematical reasoning, the model devotes identical FLOPs to each position in the sequence. This uniform allocation isn’t a design choice—it’s a constraint baked into the transformer architecture from its inception. In April 2024, researchers from Google DeepMind, McGill University, and Mila demonstrated that this constraint is not only unnecessary but actively wasteful. Their proposed Mixture-of-Depths (MoD) framework reveals a startling truth: transformers can learn to dynamically allocate compute across tokens, achieving 50% faster inference with equivalent performance. ...

6 min · 1152 words

When the Hidden State Becomes the Model: How Test-Time Training Rewrites the Rules of Sequence Modeling

The long-context problem has haunted transformer architectures since their inception. While self-attention’s $O(n^2)$ complexity is well-known, the real tragedy lies deeper: even modern RNNs like Mamba, despite their linear complexity, plateau after 16K tokens. They simply cannot compress enough information into their fixed-size hidden states. What if the hidden state wasn’t a fixed-size bottleneck, but a model that could grow in capacity through learning—even at test time? This is the radical proposition of Test-Time Training (TTT), introduced by Stanford researchers in July 2024 and extended to production-ready systems by NVIDIA and Stanford in December 2025. The results are striking: TTT-Linear matches Transformer performance while maintaining RNN efficiency, and the latest TTT-E2E achieves 2.7x faster inference than full attention at 128K context length. ...

9 min · 1743 words

From Naive to Production-Ready: The Complete Architecture of Modern RAG Systems

When you ask ChatGPT about your company’s internal documents, it hallucinates. When you ask about events after its training cutoff, it fabricates. These aren’t bugs—they’re fundamental limitations of parametric knowledge encoded in model weights. Retrieval-Augmented Generation (RAG) emerged as the solution, but naive implementations fail spectacularly. This deep dive explores how to architect RAG systems that actually work. The Knowledge Encoding Problem Large Language Models encode knowledge in two ways: parametric (weights) and non-parametric (external data). Parametric knowledge is fast but frozen at training time, prone to hallucination, and impossible to update without retraining. Non-parametric knowledge—RAG’s domain—solves all three problems at the cost of latency and complexity. ...

10 min · 2008 words

When 1+1>2: How Model Merging Creates Superhuman LLMs Without Training

The Open LLM Leaderboard tells a surprising story: many top-performing models aren’t trained at all. They’re merged. A 7B parameter model, created by strategically blending weights from existing fine-tuned models, can outperform models 10x its size. This isn’t alchemy—it’s mathematics. Model merging represents a paradigm shift in how we think about model development. Instead of investing millions in GPU hours for training, practitioners are discovering that the collective intelligence embedded in existing open-source models can be combined to create something greater than the sum of its parts. The technique requires no gradients, no backward passes, and no training data. Just arithmetic operations on weight tensors. ...

10 min · 1940 words

How Speculative Decoding Achieves 3x Faster LLM Inference Without Losing Quality: The Mathematics Behind Draft-Verify Acceleration

The sequential nature of autoregressive language models creates a fundamental bottleneck: generating each token requires a full forward pass through billions of parameters. A 70B parameter model processing a single token must load roughly 140GB of weights from memory (FP16), and memory bandwidth—not compute—becomes the limiting factor. This is why a 70B model might generate only 20-30 tokens per second on an H100, despite the GPU being capable of orders of magnitude more computation. ...

4 min · 737 words