Serial vs Parallel: The Engineering Trade-offs Behind Inference-Time Compute Scaling

When OpenAI’s o1 model spent unprecedented computational resources during inference, the AI community witnessed a paradigm shift: models could now trade thinking time for intelligence. But the real engineering challenge isn’t whether to scale inference compute—it’s how to scale it optimally. The choice between serial thinking (longer chains) and parallel thinking (more branches) fundamentally changes the cost-performance curve, and getting it wrong can mean burning 4x more compute for identical results. ...

8 min · 1530 words

When Not Every Token Deserves the Same Compute: How Mixture-of-Depths Rewrites Transformer Efficiency

Every transformer you’ve ever used treats every token with the same computational respect. Whether processing “the” or untangling complex mathematical reasoning, the model devotes identical FLOPs to each position in the sequence. This uniform allocation isn’t a design choice—it’s a constraint baked into the transformer architecture from its inception. In April 2024, researchers from Google DeepMind, McGill University, and Mila demonstrated that this constraint is not only unnecessary but actively wasteful. Their proposed Mixture-of-Depths (MoD) framework reveals a startling truth: transformers can learn to dynamically allocate compute across tokens, achieving 50% faster inference with equivalent performance. ...

6 min · 1152 words