When the Answer Lies at the End of a Branch: The Complete Architecture of Inference-Time Search Methods for LLM Reasoning

The emergence of reasoning models like DeepSeek-R1, OpenAI’s o3, and Google’s Gemini thinking mode has fundamentally shifted how we think about LLM inference. These models don’t just generate—they search. The question is no longer “what should the model output?” but “how should the model search for the answer?” This shift from generation to search has spawned an entire taxonomy of inference-time algorithms, each with distinct trade-offs between computational cost and output quality. Understanding these methods—their mathematical foundations, implementation details, and practical performance—is essential for anyone deploying reasoning models in production. ...

5 min · 932 words

Serial vs Parallel: The Engineering Trade-offs Behind Inference-Time Compute Scaling

When OpenAI’s o1 model spent unprecedented computational resources during inference, the AI community witnessed a paradigm shift: models could now trade thinking time for intelligence. But the real engineering challenge isn’t whether to scale inference compute—it’s how to scale it optimally. The choice between serial thinking (longer chains) and parallel thinking (more branches) fundamentally changes the cost-performance curve, and getting it wrong can mean burning 4x more compute for identical results. ...

8 min · 1530 words

When the Path Matters More Than the Answer: How Process Reward Models Transform LLM Reasoning

A math student solves a complex integration problem. Her final answer is correct, but halfway through, she made a sign error that accidentally canceled out in the next step. The teacher gives full marks—after all, the answer is right. But should it count? This scenario captures the fundamental flaw in how we’ve traditionally evaluated Large Language Model (LLM) reasoning: Outcome Reward Models (ORMs) only check the final destination, ignoring whether the path was sound. Process Reward Models (PRMs) represent a paradigm shift—verifying every step of reasoning, catching those hidden errors that coincidentally produce correct answers, and enabling the test-time scaling that powers reasoning models like OpenAI’s o1 and DeepSeek-R1. ...

7 min · 1473 words

When a 1B Model Beats a 405B Giant: How Test-Time Compute Is Rewriting the Rules of LLM Scaling

For years, the path to better LLMs seemed straightforward: more parameters, more training data, more compute. The scaling laws articulated by Kaplan et al. and refined by Chinchilla painted a clear picture—performance improved predictably with model size. Then OpenAI released o1, and suddenly the rules changed. A model that “thinks longer” at inference time was solving problems that eluded models 10x its size. The breakthrough wasn’t just engineering—it was a fundamental shift in how we think about compute allocation. The question flipped from “how big should we train?” to “how long should we let it think?” ...

9 min · 1722 words