Transformer Architecture

For years, the AI community operated under a seemingly unshakeable assumption: the remarkable capabilities of large language models—from in-context learning to instruction following—inherently depend on autoregressive architectures. GPT, LLaMA, Claude, and virtually every dominant LLM shares the same fundamental design: predict the next token given all previous tokens. But what if this assumption was wrong? In February 2025, a paper from researchers at Renmin University of China challenged this paradigm with striking empirical evidence. LLaDA (Large Language Diffusion with mAsking), an 8B-parameter model trained entirely from scratch using diffusion processes, achieved performance competitive with LLaMA3 8B across diverse benchmarks. More remarkably, it solved problems that have plagued autoregressive models for years—the reversal curse being the most prominent. This isn’t merely an architectural curiosity; it’s a fundamental re-examination of how language models can learn and reason. ...

Transformer Architecture

LLaDA: When Diffusion Models Challenge the Autoregressive Paradigm

How Ring Attention Breaks the Memory Barrier: Enabling Million-Token Contexts Through Distributed Computation