Inference Scaling

For years, the path to better LLMs seemed straightforward: more parameters, more training data, more compute. The scaling laws articulated by Kaplan et al. and refined by Chinchilla painted a clear picture—performance improved predictably with model size. Then OpenAI released o1, and suddenly the rules changed. A model that “thinks longer” at inference time was solving problems that eluded models 10x its size. The breakthrough wasn’t just engineering—it was a fundamental shift in how we think about compute allocation. The question flipped from “how big should we train?” to “how long should we let it think?” ...

Inference Scaling

How Recursive Language Models Break the Context Ceiling: Processing 10M+ Tokens Without Expanding the Window

When a 1B Model Beats a 405B Giant: How Test-Time Compute Is Rewriting the Rules of LLM Scaling