LLaDA: When Diffusion Models Challenge the Autoregressive Paradigm

For years, the AI community operated under a seemingly unshakeable assumption: the remarkable capabilities of large language models—from in-context learning to instruction following—inherently depend on autoregressive architectures. GPT, LLaMA, Claude, and virtually every dominant LLM shares the same fundamental design: predict the next token given all previous tokens. But what if this assumption was wrong? In February 2025, a paper from researchers at Renmin University of China challenged this paradigm with striking empirical evidence. LLaDA (Large Language Diffusion with mAsking), an 8B-parameter model trained entirely from scratch using diffusion processes, achieved performance competitive with LLaMA3 8B across diverse benchmarks. More remarkably, it solved problems that have plagued autoregressive models for years—the reversal curse being the most prominent. This isn’t merely an architectural curiosity; it’s a fundamental re-examination of how language models can learn and reason. ...

9 min · 1871 words

When a 1B Model Beats a 405B Giant: How Test-Time Compute Is Rewriting the Rules of LLM Scaling

For years, the path to better LLMs seemed straightforward: more parameters, more training data, more compute. The scaling laws articulated by Kaplan et al. and refined by Chinchilla painted a clear picture—performance improved predictably with model size. Then OpenAI released o1, and suddenly the rules changed. A model that “thinks longer” at inference time was solving problems that eluded models 10x its size. The breakthrough wasn’t just engineering—it was a fundamental shift in how we think about compute allocation. The question flipped from “how big should we train?” to “how long should we let it think?” ...

9 min · 1722 words