When OpenAI’s o1 model spent unprecedented computational resources during inference, the AI community witnessed a paradigm shift: models could now trade thinking time for intelligence. But the real engineering challenge isn’t whether to scale inference compute—it’s how to scale it optimally. The choice between serial thinking (longer chains) and parallel thinking (more branches) fundamentally changes the cost-performance curve, and getting it wrong can mean burning 4x more compute for identical results.

The Two Axes of Inference Scaling

Test-time compute scaling operates along two independent axes. Serial scaling extends the depth of reasoning—generating longer chain-of-thought traces, adding self-correction steps, or allowing the model to “think harder” about a single solution path. Parallel scaling expands the breadth—generating multiple independent solutions and selecting the best through voting or verification.

The mathematical relationship is straightforward but consequential. For serial scaling with token budget $T$ and per-token cost $c$:

$$\text{Serial Cost} = c \cdot T$$

For parallel scaling with $N$ independent samples:

$$\text{Parallel Cost} = c \cdot \sum_{i=1}^{N} T_i$$

The critical insight from Google DeepMind’s compute-optimal scaling research is that these costs don’t translate linearly to performance. A naive best-of-N baseline might require 4x the compute to match a compute-optimal strategy that adaptively chooses between serial and parallel approaches based on problem difficulty.

Serial Scaling: The Art of Extended Thinking

Budget Forcing: Controlling the Thinker

Claude 3.7 Sonnet’s extended thinking feature represents the most user-facing implementation of serial scaling. The model can be instructed to think for a specified token budget—a minimum of 1,024 tokens with Anthropic recommending incremental increases to find optimal stopping points.

The s1 paper introduced a more aggressive control mechanism called budget forcing. When the model attempts to conclude prematurely, the system appends “Wait” tokens to force continued reasoning. Conversely, when thinking exceeds the allocated budget, a “Final Answer:” delimiter terminates generation. This creates a controllable dial for trading accuracy against latency:

def budget_forcing(model, prompt, max_tokens, min_tokens=1024):
    response = model.generate(prompt)
    
    # Force continuation if too short
    while len(response.thinking_tokens) < min_tokens:
        response = model.continue_generation(response, append="Wait,")
    
    # Force termination if too long
    if len(response.thinking_tokens) > max_tokens:
        response = model.force_stop(response, append="Final Answer:")
    
    return response.final_answer

The empirical results show logarithmic accuracy gains: each doubling of thinking tokens yields diminishing but measurable improvements. On mathematical benchmarks, Claude 3.7 Sonnet’s accuracy scales predictably with the allocated thinking budget, though the marginal gains flatten significantly beyond 16k tokens for most problems.

The Underthinking Problem

A January 2025 paper identified a critical failure mode in reasoning models: underthinking. Models frequently switch between different reasoning approaches without sufficiently exploring any single path. The researchers found that incorrect answers often contain more thought transitions than correct ones—models abandon promising approaches too quickly.

The proposed solution, Thought Switching Penalty (TIP), modifies the logits of thought-switching tokens during decoding:

$$\text{logit}'_i = \text{logit}_i - \lambda \cdot \mathbb{1}_{\text{switch}}(i)$$

Where $\mathbb{1}_{\text{switch}}(i)$ indicates token $i$ triggers a thought transition. This penalty discourages premature path abandonment without requiring model retraining, improving accuracy by 2-5% on challenging mathematical benchmarks.

Chain-of-Draft: Efficient Serial Reasoning

Not all serial scaling requires verbose outputs. The Chain-of-Draft (CoD) paradigm recognizes that humans reason using minimal annotations, not full explanations. By prompting models to generate only essential intermediate information—typically 5-10 words per step—CoD achieves comparable accuracy to standard chain-of-thought while using only 7.6% of the tokens.

The trade-off is interpretability: CoD traces are less useful for understanding model reasoning but dramatically reduce inference costs for applications that only need correct answers.

Parallel Scaling: Branching and Selection

Best-of-N with Outcome Reward Models

The simplest parallel strategy generates $N$ independent solutions and selects the highest-scoring one using an Outcome Reward Model (ORM). The performance gain follows:

$$P(\text{correct}) = 1 - (1 - p)^N$$

Where $p$ is the single-sample accuracy. However, this assumes perfect verification—the ORM itself introduces error. Real-world gains are lower, and the approach requires generating $N$ full solutions before any can be evaluated.

Self-Consistency and Majority Voting

When answers are verifiable (mathematical results, multiple choice), majority voting provides a verifier-free alternative. Self-consistency generates multiple reasoning paths with diverse samples (temperature > 0) and selects the most frequent answer.

This works because reasoning errors tend to be uncorrelated across different sampling paths—the correct answer emerges through consensus. The approach is particularly effective for problems where wrong answers rarely converge on the same incorrect result.

Process Reward Models: Step-Level Verification

Process Reward Models (PRMs) represent a more sophisticated approach to parallel scaling. Instead of evaluating only final outputs, PRMs score each intermediate reasoning step. This enables:

  1. Early termination: Abandon unpromising branches before completion
  2. Tree search: Explore high-value paths through beam search or MCTS
  3. Error localization: Identify exactly where reasoning fails

OpenAI’s “Let’s Verify Step by Step” paper demonstrated that PRMs trained on process supervision achieve 78.2% accuracy on MATH problems, significantly outperforming outcome-supervised verifiers. The key insight is that PRMs learn to recognize sound reasoning patterns, not just correct final answers.

The ThinkPRM approach extends this further by using a generative process reward model that produces its own verification chain-of-thought. Instead of outputting a scalar score, the PRM generates a detailed critique of each reasoning step, improving calibration on edge cases where binary scoring fails.

Compute-Optimal Scaling: When to Think Deeper vs. Wider

The Google DeepMind paper on optimal test-time scaling provides the crucial framework: the effectiveness of serial vs. parallel scaling depends on problem difficulty.

For problems where the base model has some success probability (non-trivial but not trivial), parallel scaling often outperforms serial scaling per FLOP. When the model struggles significantly, serial scaling—allowing more exploration within a single chain—becomes more effective.

The compute-optimal strategy adaptively allocates resources based on an estimated problem difficulty:

def compute_optimal_scaling(problem, model, compute_budget):
    # Estimate difficulty from initial sample
    initial = model.generate(problem, max_tokens=512)
    difficulty = estimate_difficulty(initial)
    
    if difficulty < 0.3:  # Easy problem
        return initial  # No scaling needed
    
    elif difficulty < 0.7:  # Medium difficulty
        # Parallel scaling: multiple samples + PRM selection
        return best_of_n_with_prm(problem, model, n=8)
    
    else:  # Hard problem
        # Serial scaling: extended thinking
        return extended_thinking(problem, model, budget=budget)

The research shows this adaptive approach improves efficiency by more than 4x compared to naive best-of-N across all difficulties.

Production Deployment Considerations

Latency vs. Accuracy Trade-offs

Serial scaling introduces variable latency—the model takes longer on harder problems. This creates challenges for real-time applications with strict SLAs. A practical approach uses budget caps with graceful degradation:

def bounded_thinking(problem, model, max_latency_ms=5000):
    start_time = time.now()
    budget = 1024  # Initial budget
    
    while time.elapsed() < max_latency_ms:
        result = model.generate_with_budget(problem, budget)
        if result.confidence > THRESHOLD:
            return result
        budget = min(budget * 2, 16000)  # Double budget, cap at 16k
    
    return result  # Return best available

Cost Optimization Through Difficulty Prediction

Not every query benefits from extended thinking. Simple factual questions waste compute on unnecessary reasoning. A difficulty classifier—trained on query features or using a fast preliminary pass—can route queries appropriately:

Query Type Recommended Strategy Typical Cost
Simple factual No scaling 1x baseline
Multi-step reasoning Serial (2-4k tokens) 2-3x baseline
Complex problem-solving Serial + Parallel 10-50x baseline
Code generation Parallel with execution verification 5-20x baseline

Claude 3.7’s Dual Approach

Anthropic’s extended thinking documentation reveals a two-tier approach. For standard usage, serial scaling with controllable token budgets allows developers to set precise thinking limits. For maximum performance (reported as Claude 3.7’s “high compute” numbers on GPQA), they combine:

  1. Serial scaling: Up to 64k thinking tokens per sample
  2. Parallel scaling: 256 independent samples
  3. Learned scoring: A separate model selects the best solution

This combination achieved 84.8% on GPQA, including 96.5% on physics questions—demonstrating that the two approaches are complementary, not mutually exclusive.

The Verification Bottleneck

A critical constraint on parallel scaling is verification quality. Best-of-N assumes a reliable verifier, but:

  • Outcome Reward Models can be fooled by plausible-looking wrong answers
  • Majority voting fails when errors are systematic rather than random
  • Process Reward Models require expensive training data with step-level annotations

The s1 researchers found their budget-forcing approach outperformed majority voting on their benchmark—suggesting that weak verification makes parallel scaling less efficient than simply thinking longer. But this trade-off is problem-dependent: on code generation, execution verification provides near-perfect signals, making parallel scaling highly effective.

Forward Look: Adaptive Compute Allocation

The frontier of inference scaling research focuses on dynamic compute allocation. Models are being trained not just to reason, but to estimate their own uncertainty and adjust thinking depth accordingly. The “Learning to Reason from Feedback at Test-Time” paper introduces OpTune, a trainable optimizer that updates model weights based on mistakes—a step toward models that improve during deployment.

The engineering challenge shifts from “how much compute to allocate” to “when to stop thinking.” Current models lack reliable internal confidence signals. Extended thinking traces often contain errors that don’t propagate to final answers, making early stopping difficult. Progress on this front—better calibrated uncertainty, learned stopping criteria, or verifiable intermediate results—will determine how efficiently the next generation of reasoning models can scale.

Key Takeaways

  • Serial scaling (longer thinking) and parallel scaling (more branches) offer different cost-performance curves
  • Budget forcing provides fine-grained control over serial thinking depth
  • The “underthinking” problem—premature path switching—can be mitigated through thought switching penalties
  • Compute-optimal scaling adapts strategy based on problem difficulty, improving efficiency 4x over naive approaches
  • Verification quality is the bottleneck for parallel scaling; weak verifiers make serial approaches more efficient
  • Production deployment requires difficulty-based routing, budget caps, and latency management