When the Path Matters More Than the Answer: How Process Reward Models Transform LLM Reasoning

A math student solves a complex integration problem. Her final answer is correct, but halfway through, she made a sign error that accidentally canceled out in the next step. The teacher gives full marks—after all, the answer is right. But should it count? This scenario captures the fundamental flaw in how we’ve traditionally evaluated Large Language Model (LLM) reasoning: Outcome Reward Models (ORMs) only check the final destination, ignoring whether the path was sound. Process Reward Models (PRMs) represent a paradigm shift—verifying every step of reasoning, catching those hidden errors that coincidentally produce correct answers, and enabling the test-time scaling that powers reasoning models like OpenAI’s o1 and DeepSeek-R1. ...

7 min · 1473 words

When a 1B Model Beats a 405B Giant: How Test-Time Compute Is Rewriting the Rules of LLM Scaling

For years, the path to better LLMs seemed straightforward: more parameters, more training data, more compute. The scaling laws articulated by Kaplan et al. and refined by Chinchilla painted a clear picture—performance improved predictably with model size. Then OpenAI released o1, and suddenly the rules changed. A model that “thinks longer” at inference time was solving problems that eluded models 10x its size. The breakthrough wasn’t just engineering—it was a fundamental shift in how we think about compute allocation. The question flipped from “how big should we train?” to “how long should we let it think?” ...

9 min · 1722 words