LLM Optimization

Every transformer you’ve ever used treats every token with the same computational respect. Whether processing “the” or untangling complex mathematical reasoning, the model devotes identical FLOPs to each position in the sequence. This uniform allocation isn’t a design choice—it’s a constraint baked into the transformer architecture from its inception. In April 2024, researchers from Google DeepMind, McGill University, and Mila demonstrated that this constraint is not only unnecessary but actively wasteful. Their proposed Mixture-of-Depths (MoD) framework reveals a startling truth: transformers can learn to dynamically allocate compute across tokens, achieving 50% faster inference with equivalent performance. ...

LLM Optimization

Serial vs Parallel: The Engineering Trade-offs Behind Inference-Time Compute Scaling

When Not Every Token Deserves the Same Compute: How Mixture-of-Depths Rewrites Transformer Efficiency