LLaDA: When Diffusion Models Challenge the Autoregressive Paradigm

For years, the AI community operated under a seemingly unshakeable assumption: the remarkable capabilities of large language models—from in-context learning to instruction following—inherently depend on autoregressive architectures. GPT, LLaMA, Claude, and virtually every dominant LLM shares the same fundamental design: predict the next token given all previous tokens. But what if this assumption was wrong? In February 2025, a paper from researchers at Renmin University of China challenged this paradigm with striking empirical evidence. LLaDA (Large Language Diffusion with mAsking), an 8B-parameter model trained entirely from scratch using diffusion processes, achieved performance competitive with LLaMA3 8B across diverse benchmarks. More remarkably, it solved problems that have plagued autoregressive models for years—the reversal curse being the most prominent. This isn’t merely an architectural curiosity; it’s a fundamental re-examination of how language models can learn and reason. ...

9 min · 1871 words

Why Backpropagation Trains Neural Networks 10 Million Times Faster: The Mathematics Behind Deep Learning

In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published a paper in Nature that would transform artificial intelligence. The paper, “Learning representations by back-propagating errors,” demonstrated that a mathematical technique from the 1970s could train neural networks orders of magnitude faster than existing methods. The speedup wasn’t incremental—it was the difference between a model taking a week to train and taking 200,000 years. But backpropagation wasn’t invented in 1986. Its modern form was first published in 1970 by Finnish master’s student Seppo Linnainmaa, who described it as “reverse mode automatic differentiation.” Even earlier, Henry J. Kelley derived the foundational concepts in 1960 for optimal flight path calculations. What the 1986 paper achieved wasn’t invention—it was recognition. The authors demonstrated that this obscure numerical technique was exactly what neural networks needed. ...

9 min · 1712 words