A math student solves a complex integration problem. Her final answer is correct, but halfway through, she made a sign error that accidentally canceled out in the next step. The teacher gives full marks—after all, the answer is right. But should it count?
This scenario captures the fundamental flaw in how we’ve traditionally evaluated Large Language Model (LLM) reasoning: Outcome Reward Models (ORMs) only check the final destination, ignoring whether the path was sound. Process Reward Models (PRMs) represent a paradigm shift—verifying every step of reasoning, catching those hidden errors that coincidentally produce correct answers, and enabling the test-time scaling that powers reasoning models like OpenAI’s o1 and DeepSeek-R1.
The Verification Problem: Why Final Answers Lie
Traditional reward models operate on a simple principle: generate a solution, check if the final answer matches the ground truth, assign reward accordingly. This approach, known as outcome supervision, works well for tasks with deterministic outcomes but catastrophically fails for multi-step reasoning.
Consider a model solving the equation $2x + 6 = 14$. A correct solution path would be:
Step 1: Subtract 6 from both sides → 2x = 8
Step 2: Divide by 2 → x = 4
But imagine the model produces:
Step 1: Subtract 6 from both sides → 2x = 8
Step 2: Divide by 4 → x = 2 [ERROR]
Step 3: Multiply by 2 → x = 4 [ERROR]
An outcome reward model sees $x = 4$ and assigns maximum reward. The model learned nothing about its reasoning mistakes—in fact, it may have reinforced incorrect problem-solving strategies. Research from OpenAI’s “Let’s Verify Step by Step” paper (Lightman et al., 2023) demonstrates that up to 18% of “correct” solutions on the MATH benchmark contain reasoning errors masked by lucky cancellations.
How Process Reward Models Work
PRMs introduce granular, step-level verification. Instead of a single reward at the end, PRMs assign a score $r_i$ to each reasoning step $s_i$:
$$R_{total} = \sum_{i=1}^{n} r_i \cdot w_i$$where $w_i$ represents optional weighting factors. This formulation enables several critical capabilities:
Fine-Grained Error Detection: Each step receives independent evaluation. The model learns exactly where reasoning diverges from correctness.
Dense Feedback Signal: Unlike sparse outcome rewards that provide signal only at trajectory termination, PRMs offer continuous feedback throughout the reasoning process.
Search Guidance: During inference, PRMs can guide tree search algorithms (beam search, MCTS) toward promising reasoning paths, dramatically improving test-time compute efficiency.
The architecture of a PRM typically mirrors the base language model with a classification head added on top. For each token position corresponding to a step boundary, the model predicts correctness probability:
$$P(correct | s_{\leq i}, question) = \sigma(W \cdot h_i + b)$$where $h_i$ is the hidden state at step boundary $i$, and $\sigma$ is the sigmoid function.
The Data Bottleneck: From Manual Annotation to Automatic Generation
Training PRMs requires step-level correctness labels—a far more expensive proposition than outcome labels. OpenAI’s PRM800K dataset (Lightman et al., 2023) contains 800,000 step-level labels across 75,000 solutions to 12,000 MATH problems. Each label required human annotators to judge whether a reasoning step was mathematically sound.
The cost is staggering: PRM800K required an estimated 2,000+ annotator hours. This bottleneck sparked innovation in automatic annotation methods.
Math-Shepherd (Wang et al., 2023) pioneered automatic process annotation through a clever insight: if a step leads to correct final answers more often when completed, it’s likely correct. The algorithm:
- For each step $s_i$, sample multiple completions
- Measure the probability of reaching correct final answers from step $i$
- Use this “completion correctness rate” as a proxy for step correctness
This approach generated 1.5 million process supervision annotations automatically, training PRMs that approached human-labeled performance.
ThinkPRM (Khalifa et al., 2025) took a different approach: instead of discriminative classification, use generative verification. The model produces a chain-of-thought explanation for why each step is correct or incorrect:
Step: "Divide both sides by x"
Verification: "This step has a critical flaw. When we divide by x,
we assume x ≠ 0. The equation could have x = 0 as a solution,
which would be lost. This step is INCORRECT."
Remarkably, ThinkPRM achieves competitive performance with only 8,000 process labels—100× fewer than PRM800K—by leveraging the model’s own reasoning capabilities for verification.
PRM at Inference: Scaling Test-Time Compute
The true power of PRMs emerges during inference. They enable sophisticated test-time scaling strategies that trade computation for accuracy.
Best-of-N Sampling: Generate $N$ candidate solutions, score each with the PRM, select the highest-scoring one. Research shows that PRM-guided Best-of-N achieves 2-3× better accuracy than outcome-based selection at equivalent compute budgets.
Beam Search with PRM: Maintain $k$ reasoning paths in parallel, expand each by one step, use PRM scores to prune and select the top $k$ for the next iteration. This approach combines the exploration of tree search with PRM’s step-level guidance:
def prm_beam_search(question, model, prm, beam_width=4, max_steps=10):
beams = [([], 0.0)] # (steps, cumulative_score)
for _ in range(max_steps):
candidates = []
for steps, score in beams:
# Generate next step candidates
next_steps = model.generate_next_steps(question, steps, n=beam_width)
for step in next_steps:
step_score = prm.score_step(question, steps, step)
new_score = score + step_score
candidates.append((steps + [step], new_score))
# Keep top-k beams
beams = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]
return beams[0][0] # Return highest-scoring solution
MCTS Integration: Monte Carlo Tree Search treats reasoning as a game tree, with PRM providing node evaluation. Each node represents a partial solution, and PRM scores guide the search toward promising branches while exploring alternatives:
graph TD
A[Question] --> B[Step 1: Expand]
B --> C[Step 2: Simplify]
B --> D[Step 2: Substitute]
C --> E[Step 3: Solve]
C --> F[Step 3: Factor]
D --> G[Step 3: Integrate]
style A fill:#e1f5fe
style E fill:#c8e6c9
style F fill:#ffcdd2
style G fill:#c8e6c9
Benchmarks and Real-World Performance
Several benchmarks have emerged to evaluate PRM capabilities:
ProcessBench (Zheng et al., 2024) contains 3,200 mathematical reasoning problems with step-level error annotations. Unlike MATH or GSM8K, ProcessBench directly measures a PRM’s ability to identify which specific step contains an error. Top PRMs achieve ~70% F1 score on this benchmark—impressive but revealing substantial room for improvement.
PRMBench focuses on fine-grained error detection across error types: calculation errors, logical errors, and missing steps. Results show PRMs excel at catching calculation errors (>85% accuracy) but struggle with subtle logical flaws (<60% accuracy).
The performance gap between PRM-guided and unguided inference is substantial:
| Model | MATH (no PRM) | MATH (PRM Best-of-16) | Improvement |
|---|---|---|---|
| Llama-3-70B | 45.2% | 58.7% | +13.5% |
| DeepSeek-V2 | 51.8% | 67.3% | +15.5% |
| Qwen-2.5-72B | 48.9% | 63.1% | +14.2% |
Beyond Mathematics: PRMs in Medicine and Code
The PRM paradigm extends beyond mathematical reasoning:
Med-PRM (Shi et al., 2025) applies process verification to medical diagnosis. Each diagnostic step is verified against established medical guidelines retrieved via RAG. The system achieved 12% improvement on clinical reasoning benchmarks by catching reasoning steps that contradicted established medical knowledge.
Code Verification: PRMs can verify each step of code generation—checking variable scopes, type consistency, and logical flow before execution. Early experiments show 23% reduction in runtime errors for generated code.
Challenges and Limitations
PRMs aren’t without significant challenges:
Annotation Noise: Math-Shepherd’s automatic annotation introduces noise—the proxy of “completion success rate” doesn’t perfectly correlate with step correctness. Research shows ~15% label disagreement between automatic and human annotation.
Out-of-Distribution Generalization: PRMs trained on algebra problems struggle with geometry or number theory. The step-level patterns they learn don’t always transfer across mathematical domains.
Computational Overhead: Scoring every step during inference adds latency. For real-time applications, the 2-5× slowdown from PRM evaluation may be unacceptable.
The Step Definition Problem: What constitutes a “step”? Different datasets and papers use different granularities, making fair comparison difficult. Some define steps by line breaks, others by logical operations, still others by human judgment.
The Road Ahead: From Verification to True Reasoning
PRMs represent a crucial piece in the puzzle of LLM reasoning, but they’re not the complete picture. The most advanced reasoning models—OpenAI’s o1, DeepSeek-R1—combine PRMs with reinforcement learning to create feedback loops: the PRM identifies errors, RL trains the model to avoid them, improved model generates cleaner reasoning chains, PRM becomes more effective.
The next frontier involves self-improving verification: PRMs that learn from their own mistakes, updating their step-level judgments based on downstream outcomes. Early research on iterative PRM training shows promise, with each iteration improving both the generator and verifier.
Process Reward Models have transformed how we think about LLM reasoning verification. By focusing on the path rather than just the destination, they’ve enabled the test-time scaling that powers today’s reasoning models. The journey from outcome supervision to process supervision mirrors a deeper truth about intelligence itself: in complex reasoning, how you arrive at an answer matters as much as the answer itself.
The student who accidentally canceled errors to reach the right answer? She might pass the test, but she won’t become a mathematician. PRMs ensure LLMs learn the right lessons from every problem they solve—not just which answers are correct, but which reasoning paths reliably lead there.