The emergence of reasoning models like DeepSeek-R1, OpenAI’s o3, and Google’s Gemini thinking mode has fundamentally shifted how we think about LLM inference. These models don’t just generate—they search. The question is no longer “what should the model output?” but “how should the model search for the answer?”
This shift from generation to search has spawned an entire taxonomy of inference-time algorithms, each with distinct trade-offs between computational cost and output quality. Understanding these methods—their mathematical foundations, implementation details, and practical performance—is essential for anyone deploying reasoning models in production.
The Fundamental Trade-off: Quality vs. Compute
Inference-time scaling represents a paradigm shift from the traditional approach of building larger models. Instead of increasing parameter counts, we allocate more computation during inference to improve output quality.
The scaling law for inference-time compute follows a power-law relationship:
$$ \text{Quality}(C) = \alpha \cdot C^{\beta} $$where $C$ is the total FLOPs spent during inference, and $\alpha$, $\beta$ are task-dependent constants. This relationship has been empirically validated across multiple reasoning benchmarks, showing that a smaller model with more inference compute can match or exceed the performance of a larger model with less compute.
But here’s the critical insight: not all inference-time strategies are created equal. A naive approach of generating $N$ responses and selecting the best (Best-of-N) scales linearly with $N$, but smarter algorithms can achieve the same quality with significantly less computation.
Best-of-N Sampling: The Baseline That Started Everything
Best-of-N (BoN) sampling is the simplest inference-time optimization: generate $N$ candidate responses, score each with a reward model, and return the highest-scoring one.
The algorithm is straightforward:
def best_of_n(prompt, model, reward_model, n):
candidates = []
for i in range(n):
response = model.generate(prompt)
score = reward_model.score(prompt, response)
candidates.append((response, score))
return max(candidates, key=lambda x: x[1])[0]
The expected improvement from BoN follows from order statistics. If individual response qualities are drawn from distribution $F(q)$, the quality of the best of $N$ samples is:
$$ \mathbb{E}[Q_{\max}] = \int_{-\infty}^{\infty} q \cdot N \cdot F(q)^{N-1} \cdot f(q) \, dq $$This shows diminishing returns: doubling $N$ doesn’t double quality. More critically, BoN requires $N$ times the computation of a single generation, making it prohibitively expensive for large-scale deployment.
Recent benchmarks on GSM8K show that BoN-16 with GPT-3.5 achieves approximately 78% accuracy, while BoN-64 reaches 82%—a 4% improvement at 4× the cost. The question becomes: can we achieve similar gains more efficiently?
Optimal Stopping Theory: When Mathematics Meets Inference
A 2025 breakthrough from researchers at UC Berkeley and CMU reframed inference-time optimization through the lens of optimal stopping theory, specifically the Pandora’s Box problem.
The Pandora’s Box Formulation: Each LLM generation is analogous to opening a “box” that reveals a random reward (response quality). The cost is the computational expense of generation. The challenge: decide when to stop generating and accept the best response seen so far.
The key insight is that when the reward distribution is unknown, we can use a UCB-style (Upper Confidence Bound) algorithm that learns stopping thresholds on the fly.
The algorithm maintains:
- Running estimates of reward statistics: $\hat{\mu}_t$, $\hat{\sigma}_t$
- An adaptive threshold $\tau_t$ based on expected future gains
- A stopping criterion that balances exploration vs. exploitation
The threshold is computed using a Bradley-Terry inspired transformation to normalize rewards across different prompts:
$$ \tau_t = \hat{\mu}_t + \beta \cdot \hat{\sigma}_t \cdot \sqrt{\frac{\log t}{t}} $$where $\beta$ is a tunable parameter controlling exploration.
Results: On AlpacaFarm and HH-RLHF datasets, this optimal stopping approach achieves the same performance as BoN while requiring 15-35% fewer generations. The theoretical analysis proves that the algorithm’s competitive ratio (ratio of achieved reward to optimal reward) converges to within a constant factor of the theoretical optimum.
The elegance of this approach lies in its adaptivity: it generates more responses for difficult prompts where quality variance is high, and fewer for easy prompts where early termination is optimal.
Tree-of-Thoughts: Reasoning as Search
Tree-of-Thoughts (ToT), introduced by Yao et al. in 2023, represents a fundamental shift from linear generation to explicit search. Instead of generating a single chain of reasoning, ToT maintains a tree of reasoning states and explores multiple paths.
The Architecture:
- Thought Decomposition: Break complex problems into intermediate reasoning steps
- State Evaluation: Use a value function or LLM self-evaluation to score partial solutions
- Search Strategy: Use BFS, DFS, or beam search to explore the tree
class TreeOfThoughts:
def search(self, problem, max_depth, beam_width):
frontier = [ThoughtNode(problem, "", 0)]
solutions = []
for depth in range(max_depth):
next_frontier = []
for node in frontier:
# Generate possible next thoughts
thoughts = self.generate_thoughts(node, k=beam_width)
for thought in thoughts:
# Evaluate the thought
score = self.evaluate_thought(node, thought)
if score > threshold:
new_node = ThoughtNode(problem, thought, score)
next_frontier.append(new_node)
# Keep top beam_width nodes
frontier = sorted(next_frontier, key=lambda x: x.score)[:beam_width]
return self.extract_solution(frontier)
ToT excels in tasks requiring backtracking, such as mathematical reasoning and strategic planning. On the Game of 24 benchmark, GPT-4 with ToT achieves 74% success rate compared to 4% with standard prompting—a 18.5× improvement.
However, ToT’s computational cost grows exponentially with tree depth. The number of LLM calls is:
$$ \text{Calls} = \sum_{d=0}^{D} b^d = \frac{b^{D+1} - 1}{b - 1} $$where $b$ is the branching factor and $D$ is the maximum depth. This quickly becomes intractable for deep reasoning tasks.
TreeBoN: Bridging Tree Search and Sampling
TreeBoN, presented at EMNLP 2025, combines the best of both worlds: the quality improvements of tree search with the efficiency of sampling.
Key Innovation: Instead of generating complete responses and then searching, TreeBoN performs speculative tree search during generation. It maintains multiple partial responses in parallel, branching at strategic points and pruning low-quality paths.
The algorithm leverages token-level rewards from Direct Preference Optimization (DPO):
$$ r_{\text{token}}(x_t | x_{The TreeBoN Workflow:
- Initialize with $N$ root nodes (similar to BoN)
- Iterate:
- For each active node, generate the next token
- Compute token-level reward
- Branch high-reward nodes (create multiple continuations)
- Prune low-reward nodes (free up compute)
- Terminate when all paths reach end-of-sequence or budget exhausted
- Return the highest-scoring complete response
Performance: On TutorEval, TreeBoN achieves a 65% win rate against baseline BoN with identical computational cost. On GSM8K, it improves accuracy from 78% (BoN-16) to 83% with the same FLOPs.
The key advantage is that TreeBoN concentrates computation on promising paths rather than wasting it on dead ends. Early pruning decisions prevent investing compute in responses that would score poorly anyway.
Monte Carlo Tree Search: AlphaGo’s Legacy in Language
MCTS, the algorithm behind AlphaGo’s historic victory, has found new life in LLM reasoning. The connection is natural: both games and reasoning involve sequential decision-making under uncertainty.
MCTS for LLMs adapts the four classic phases:
-
Selection: Traverse the tree using UCB:
$$\text{UCB}(n) = \frac{Q(n)}{N(n)} + c \sqrt{\frac{\ln N(p)}{N(n)}}$$where $n$ is the current node, $p$ is its parent, and $c$ is the exploration constant.
-
Expansion: Generate possible next thoughts/steps
-
Simulation: Complete the reasoning chain (rollout)
-
Backpropagation: Update value estimates up the tree
Progressive Widening: A critical optimization for LLMs. Instead of expanding all possible tokens at each node, progressively widen the branching factor as visitation counts increase:
$$ b(n) = \min\left(b_{\max}, \left\lfloor c \cdot N(n)^\alpha \right\rfloor \right) $$This prevents the combinatorial explosion while ensuring adequate exploration.
ReST-MCTS*, a recent ICML 2025 contribution, combines MCTS with self-training. The tree search is used to generate high-quality reasoning traces, which then train the model—creating a reinforcement loop that continuously improves both the model and the search policy.
Results on MATH benchmark show MCTS-based methods achieving 58% accuracy compared to 42% for standard prompting, but at 10-50× the computational cost.
Reward-Guided Speculative Decoding: The Efficiency Frontier
Reward-Guided Speculative Decoding (RSD), introduced by Salesforce AI Research in early 2025, represents the current frontier in efficient inference-time reasoning.
The Core Idea: Combine a fast draft model with a powerful target model, using a Process Reward Model (PRM) to guide when to accept draft outputs vs. invoke the full model.
Architecture:
- Draft Model ($M_D$): A small, fast model (e.g., 1B-7B parameters)
- Target Model ($M_T$): A large, capable model (e.g., 70B+ parameters)
- Process Reward Model: Evaluates intermediate reasoning steps
The algorithm:
def rsd_decode(prompt, draft_model, target_model, prm, threshold):
tokens = []
while not eos:
# Draft phase: generate candidate tokens
draft_tokens = draft_model.generate(prompt + tokens, k=8)
for i, token in enumerate(draft_tokens):
# Evaluate partial reasoning
reward = prm.score(prompt + tokens + draft_tokens[:i+1])
if reward < threshold:
# Low reward: invoke target model
verified = target_model.verify(prompt + tokens, draft_tokens[:i])
tokens.extend(verified)
break
elif reward > acceptance_threshold:
# High reward: accept draft token
tokens.append(token)
else:
# Medium reward: use rejection sampling
tokens.append(speculative_sample(token, draft_model, target_model))
return tokens
Theoretical Foundation: RSD optimizes a threshold-based mixture strategy that achieves provably optimal trade-offs between compute and quality. The key insight is that not every token needs the target model’s scrutiny—most can be handled by the draft model, with the target model focusing on critical decision points.
Performance: On Olympiad-level reasoning tasks, RSD achieves up to 4.4× fewer FLOPs compared to target-only decoding while maintaining or exceeding accuracy. On MATH-500, it improves accuracy by +3.5 points over parallel decoding methods.
Comparative Analysis: The Performance Landscape
| Method | Compute Scaling | GSM8K | MATH | AlpacaFarm Win Rate | Implementation Complexity |
|---|---|---|---|---|---|
| Greedy | $O(1)$ | 52% | 24% | 28% | ★☆☆☆☆ |
| Best-of-N | $O(N)$ | 82% (N=64) | 48% (N=64) | 54% (N=64) | ★★☆☆☆ |
| Optimal Stopping | $O(N_{\text{adaptive}})$ | 80% | 46% | 52% | ★★★☆☆ |
| Tree-of-Thoughts | $O(b^D)$ | 85% | 52% | 58% | ★★★★☆ |
| TreeBoN | $O(N \cdot d_{\text{avg}})$ | 83% | 51% | 65% | ★★★★☆ |
| MCTS | $O(V \cdot D)$ | 88% | 58% | 62% | ★★★★★ |
| RSD | $O(N \cdot r_{\text{draft}})$ | 86% | 55% | 60% | ★★★★☆ |
Key Observations:
- Best bang for buck: Optimal Stopping offers 80-90% of BoN’s quality at 65-85% of the cost
- Maximum quality: MCTS achieves the highest scores but at 10-50× the cost of greedy
- Best efficiency frontier: RSD provides near-MCTS quality at 4× lower compute
- Best win rate: TreeBoN leads on alignment tasks with 65% win rate
Production Considerations
Latency vs. Quality Trade-offs:
Real-time applications (chatbots, coding assistants) typically operate under strict latency budgets (< 2 seconds for first token). This constrains the search space:
# Latency-aware search configuration
def configure_search(latency_budget_ms, task_difficulty):
if latency_budget_ms < 500:
return {"method": "optimal_stopping", "max_samples": 4}
elif latency_budget_ms < 2000:
if task_difficulty > 0.7:
return {"method": "TreeBoN", "budget": 16}
else:
return {"method": "BoN", "n": 8}
else:
return {"method": "MCTS", "simulations": 100}
Compute Budgeting:
For batch processing (evaluations, content generation), the key metric is FLOPs per quality point. Analysis shows:
- Simple tasks (GSM8K): Optimal Stopping provides 2.1× FLOPs efficiency over BoN
- Complex tasks (MATH): MCTS provides best quality but RSD offers 4.4× better efficiency
- Alignment tasks (HH-RLHF): TreeBoN achieves best win rates per FLOP
Infrastructure Requirements:
| Method | Memory Overhead | Parallelization | Reward Model Required |
|---|---|---|---|
| Best-of-N | $N \times$ output cache | Perfect | Yes |
| Optimal Stopping | Minimal | Perfect | Yes |
| ToT | $b^D$ state cache | Limited | Yes (value function) |
| TreeBoN | $N \times d$ partial cache | Good | Yes (token-level) |
| MCTS | Tree structure | Poor | Yes (simulation) |
| RSD | Draft + target states | Excellent | Yes (PRM) |
The Road Ahead: Open Problems and Emerging Directions
Adaptive Search Depth: Current methods use fixed search parameters. Future systems should dynamically adjust depth and branching based on problem difficulty estimates.
Multi-Model Ensembles: Combining multiple models with different capabilities (reasoning specialists, code specialists) within a single search process shows promise but introduces coordination challenges.
Learned Search Policies: Instead of generic search algorithms, training neural networks to predict optimal search actions could dramatically improve efficiency. Early work on “Neural Chain-of-Thought Search” suggests 2-3× speedups.
Memory-Efficient Search: Current tree-based methods require significant memory for state caching. Innovations in incremental computation and state compression could enable deeper searches on limited hardware.
Verification-Guided Search: For mathematical reasoning, integrating formal verification tools (Lean, Coq) as terminal evaluators could ensure correctness while pruning the search space.
The Verdict
Inference-time search has transformed from a research curiosity to a production necessity. The choice of method depends critically on:
- Latency requirements: Optimal Stopping for real-time, MCTS for offline
- Compute budget: RSD for efficiency-critical, BoN for simplicity
- Task type: TreeBoN for alignment, MCTS for mathematical reasoning
- Infrastructure: ToT for stateful, BoN for stateless deployments
The mathematics is clear: intelligent search algorithms can achieve 4× efficiency improvements over naive sampling. The engineering challenge is implementing these algorithms robustly at scale—a challenge that will define the next generation of LLM deployment infrastructure.
The future isn’t about bigger models generating better answers—it’s about smarter models searching more efficiently. And that search begins at the end of a branch.