The sequential nature of autoregressive language models creates a fundamental bottleneck: generating each token requires a full forward pass through billions of parameters. A 70B parameter model processing a single token must load roughly 140GB of weights from memory (FP16), and memory bandwidth—not compute—becomes the limiting factor. This is why a 70B model might generate only 20-30 tokens per second on an H100, despite the GPU being capable of orders of magnitude more computation.
Speculative decoding flips this paradigm. Instead of generating one token per forward pass, it generates multiple tokens in parallel by speculating ahead and verifying in bulk. The result? 2-6x speedup with zero quality degradation—every output is mathematically identical to standard autoregressive generation.
The Core Insight: Verification Is Cheaper Than Generation
The key observation behind speculative decoding is counterintuitive: verifying multiple tokens in parallel is nearly as fast as generating a single token. When a large language model processes a sequence, computing the probability distribution for the next token at positions 1, 2, 3, 4, and 5 simultaneously costs roughly the same as computing just position 1.
This happens because transformer inference is dominated by the attention mechanism’s memory bandwidth requirements. Loading the KV cache and model weights dominates runtime, while the actual computation for multiple positions adds minimal overhead. A forward pass that scores 5 candidate tokens costs perhaps 1.1x the time of scoring 1 token—not 5x.
The speculative decoding pipeline exploits this asymmetry:
- Draft: A smaller, faster model generates $\gamma$ candidate tokens autoregressively
- Verify: The target model evaluates all $\gamma$ positions in a single forward pass
- Accept/Reject: Accept tokens from left to right until the first rejection
The beauty lies in the guarantee: if implemented correctly, the output distribution is identical to sampling directly from the target model. No approximation, no quality trade-off.
The Mathematics of Correctness: Rejection Sampling
The critical challenge is deciding which draft tokens to accept while preserving the target model’s distribution. Naively accepting tokens when both models agree would bias outputs toward the intersection of their distributions. Instead, speculative decoding uses a principled rejection sampling scheme.
The Acceptance Criterion
Let $p(x)$ denote the target model’s probability distribution over the next token, and $q(x)$ denote the draft model’s distribution. When the draft proposes token $x$, the acceptance probability is:
$$\alpha(x) = \min\left(1, \frac{p(x)}{q(x)}\right)$$This formula has elegant intuition:
- When $p(x) > q(x)$: The target model favors this token more than the draft predicted. Accept with probability 1—the draft “underproposed” this token.
- When $p(x) < q(x)$: The draft overestimated this token’s probability. Accept with probability $\frac{p(x)}{q(x)}$ to correct for the draft’s enthusiasm.
Why This Preserves the Target Distribution
The acceptance criterion works because it captures exactly the probability mass where the two distributions overlap. When we accept with probability $\alpha(x)$, the probability of generating token $x$ through acceptance is:
$$P(\text{accept } x) = q(x) \cdot \alpha(x) = q(x) \cdot \min\left(1, \frac{p(x)}{q(x)}\right) = \min(p(x), q(x))$$But this doesn’t equal $p(x)$ yet—we need to handle rejections.
The Residual Distribution
When a draft token is rejected, we don’t simply redraft. Instead, we sample from a carefully constructed residual distribution that “fills in” the missing probability mass:
$$r(x) = \frac{\max(0, p(x) - q(x))}{\sum_{x'} \max(0, p(x') - q(x'))}$$The residual captures exactly what rejection sampling missed: tokens where $p(x) > q(x)$. Geometrically, if you overlay the target distribution’s histogram on top of the draft distribution, the residual represents portions of the target that “stick out above” the draft.
The complete algorithm guarantees correctness. A token $x$ can be generated through two paths:
- Acceptance path: Draft $x$ with probability $q(x)$, accept with probability $\alpha(x)$
- Rejection path: Reject and sample from residual with probability $r(x)$
The combined probability equals $p(x)$ exactly—this is mathematically proven, not approximate.
def speculative_sample(p_target, q_draft, draft_token):
"""
p_target: target model probability distribution (vocab_size,)
q_draft: draft model probability distribution (vocab_size,)
draft_token: the token proposed by the draft model
"""
import numpy as np
# Acceptance probability
alpha = min(1.0, p_target[draft_token] / q_draft[draft_token])
# Sample uniform random for acceptance test
u = np.random.random()
if u < alpha:
return draft_token, True
else:
# Sample from residual distribution
residual = np.maximum(0, p_target - q_draft)
residual = residual / residual.sum() # normalize
return np.random.choice(len(residual), p=residual), False
Extending to Multiple Tokens
In practice, the draft model proposes a sequence of $\gamma$ tokens. Verification proceeds left-to-right because language models produce conditional distributions: each token’s probability depends on all preceding tokens.
For position $i$ with draft token $x_i$:
$$\alpha(x_i) = \min\left(1, \frac{p(x_i | x_{If position $i$ rejects, we sample from the residual distribution and discard all subsequent draft tokens. This is essential: tokens at positions $i+1, i+2, \ldots$ were conditioned on $x_i$, so once we sample a different token, the entire subsequent sequence becomes invalid.This creates a subtle efficiency consideration: even though all positions are verified in a single forward pass, we can only use tokens up to the first rejection. However, the algorithm guarantees at least one valid token per iteration—if all drafts reject, we still get one token from the residual distribution.
Expected Speedup Analysis
The speedup from speculative decoding depends on three factors:
- Acceptance probability ($\alpha$): How often draft tokens match the target distribution
- Draft length ($\gamma$): How many tokens we speculate ahead
- Cost ratio ($c$): How much faster the draft model is than the target model
Token Acceptance Model
Let $\alpha$ denote the average acceptance probability. The expected number of tokens generated per iteration follows a truncated geometric distribution:
$$E[\text{tokens}] = \frac{1 - \alpha^{\gamma}}{1 - \alpha} + \alpha^{\gamma} \cdot \gamma$$This simplifies nicely. When $\alpha$ is close to 1, we accept almost all drafts, generating nearly $\gamma$ tokens per iteration. When $\alpha$ is low, most iterations produce just 1 token.
The Speedup Formula
Let $c$ be the cost ratio—the time for one target model forward pass divided by the time for one draft model forward pass. For a 70B target with a 7B draft, $c \approx 10$.
Without speculative decoding, generating $n$ tokens takes time $n \cdot t_{\text{target}}$.
With speculative decoding, each iteration costs $t_{\text{draft}} + t_{\text{target}}$ (draft generation + verification) and produces $E[\text{tokens}]$ tokens on average.
The speedup ratio:
$$S = \frac{E[\text{tokens}]}{1 + \frac{1}{c}} = \frac{\frac{1 - \alpha^{\gamma}}{1 - \alpha} + \alpha^{\gamma} \cdot \gamma}{1 + \frac{1}{c}}$$When the draft model is very fast ($c \to \infty$), this simplifies to:
$$S_{\max} = \frac{1 - \alpha^{\gamma}}{1 - \alpha} + \alpha^{\gamma} \cdot \gamma$$Optimal Draft Length
The optimal draft length $\gamma^*$ balances generating more candidates against the diminishing probability of accepting them all. Taking the derivative and setting to zero yields:
$$\gamma^* \approx \frac{-\ln(c-1)}{\ln(\alpha)}$$For typical values ($\alpha = 0.7$, $c = 10$), optimal draft length ranges from 4 to 8 tokens. Longer drafts waste computation because the acceptance probability drops exponentially with position.
Evolution of Speculative Decoding Methods
The field has evolved rapidly since DeepMind’s seminal 2023 paper. Modern methods fall into several categories, each with distinct trade-offs.
Vanilla Speculative Sampling
The original approach uses a smaller model from the same family as the draft model. For a 70B target, you might use a 7B draft. The draft model runs autoregressively, generating $\gamma$ tokens, which the target then verifies.
Pros: Conceptually simple, works with any model pair Cons: Requires maintaining a separate draft model, quality depends heavily on model similarity
Medusa: Multiple Decoding Heads
Medusa eliminates the separate draft model by attaching multiple prediction heads to the target model itself. Each head predicts the next token at different positions—head 1 predicts the next token, head 2 predicts the token after that, and so on.
┌─────────┐
│ LM Head │ → Token t+1
└─────────┘
┌─────────┐
Target Model ──────►│ Head 1 │ → Token t+1 (alternative)
└─────────┘
┌─────────┐
│ Head 2 │ → Token t+2
└─────────┘
┌─────────┐
│ Head 3 │ → Token t+3
└─────────┘
The heads are trained with a combined loss that encourages accurate multi-token prediction. This approach requires no separate model but needs fine-tuning to add the heads.
Pros: No separate model, low overhead Cons: Requires fine-tuning, lower acceptance rates than draft models
EAGLE: Feature-Level Autoregression
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) reuses the target model’s internal features rather than just the final logits. The key insight: the top-layer features before the LM head contain rich information about the next token’s distribution.
Instead of predicting tokens directly, EAGLE trains a lightweight draft model to predict features autoregressively. The target model’s LM head then converts these features to token predictions.
EAGLE-3 Improvements:
- Training-time testing: Simulates the multi-step drafting process during training, eliminating train-inference mismatch
- Multi-layer feature fusion: Instead of using only top-layer features, EAGLE-3 fuses low, middle, and high-level features from the target model
- Direct token prediction: Removes the feature prediction constraint, allowing the draft model to directly predict tokens
EAGLE-3 achieves up to 6.5x speedup—significantly outperforming earlier methods:
| Method | Vicuna 13B | LLaMA 3.1 8B | LLaMA 3.3 70B |
|---|---|---|---|
| Vanilla SpS | 1.92x | — | — |
| Medusa | 2.12x | — | — |
| EAGLE | 3.05x | 3.23x | 2.85x |
| EAGLE-2 | 4.22x | 3.23x | 2.85x |
| EAGLE-3 | 5.51x | 4.44x | 4.12x |
Self-Speculative Decoding: LayerSkip
What if you could use the same model as both draft and target? LayerSkip enables this through early exit and layer dropout during training.
The model is trained with layer dropout, teaching it to produce meaningful outputs even when skipping layers. During inference, you can exit early (e.g., after layer 20 of a 32-layer model) for drafting, then run the full model for verification.
Pros: No separate model, no extra memory Cons: Requires special training, lower acceptance rates
N-Gram Speculative Decoding
The simplest approach requires no model at all. N-gram matching looks for repeated patterns in the generated text and uses those as drafts. If the model just generated “the quick brown fox”, and “the quick brown fox jumps” appeared earlier in the conversation, it proposes “jumps” as the next draft.
def ngram_speculate(generated_tokens, n=4):
"""Find n-grams from generated text that could continue."""
# Build n-gram index from generated text
ngram_index = {}
for i in range(len(generated_tokens) - n):
key = tuple(generated_tokens[i:i+n])
continuation = generated_tokens[i+n] if i+n < len(generated_tokens) else None
if continuation and key not in ngram_index:
ngram_index[key] = continuation
# Find matching n-gram at current position
current_key = tuple(generated_tokens[-n:])
return ngram_index.get(current_key, None)
Pros: Zero memory overhead, works with any model Cons: Only helps with repetitive text, low acceptance rates for creative content
Tree-Based Verification
A single draft sequence has limited parallelism—rejection at position 2 discards positions 3, 4, 5, etc. Tree-based methods generate multiple candidate continuations and verify them all in parallel.
Consider this draft tree:
[the]
/ \
[cat] [dog]
/ \ / \
[sat][ran][ate][bark]
Instead of one sequence, we have 4 possible paths. The target model verifies all positions simultaneously using tree attention, which masks attention to ensure each position only attends to its ancestors in the tree.
If “[the] → [cat] → [sat]” is accepted, we gain 3 tokens. If “[cat]” is rejected, we can still accept “[the] → [dog]” and its children.
EAGLE-2’s Dynamic Trees: Rather than using a fixed tree structure, EAGLE-2 estimates acceptance probabilities using the draft model’s confidence and dynamically constructs the tree to maximize expected accepted tokens.
When Speculative Decoding Works (And When It Doesn’t)
Ideal Scenarios
Speculative decoding shines in these conditions:
- Low to medium QPS: When the system isn’t saturated with concurrent requests, there’s compute headroom for speculation
- Memory-bandwidth bound: The target model is limited by memory bandwidth, not compute
- High acceptance rates: The draft model closely matches the target model’s distribution
| Scenario | Typical Speedup |
|---|---|
| Chat completion (low QPS) | 2.5-3.5x |
| Code generation | 3-6x |
| Summarization | 2-3x |
| Reasoning tasks | 2-4x |
The High-QPS Trap
At high query-per-second (QPS), speculative decoding can actually slow down throughput. The reason: when the GPU is fully utilized with concurrent requests, adding draft model computation increases total work without increasing parallelism.
vLLM’s own benchmarks show this clearly:
| QPS | Without Spec Dec | With Spec Dec | Change |
|---|---|---|---|
| Low | 25 tok/s | 45 tok/s | +80% |
| Medium | 80 tok/s | 110 tok/s | +37% |
| High | 150 tok/s | 105 tok/s | -30% |
The crossover point depends on hardware, model size, and batch configuration. As a rule of thumb, speculative decoding helps when GPU utilization is below 70-80%.
Acceptance Rate Thresholds
The acceptance rate cliff is real: speculative decoding only helps when acceptance rates exceed ~60%. Below this threshold, the overhead of running the draft model and verifying frequent rejections exceeds the benefit.
Acceptance Rate | Speedup
----------------|--------
90% | 3.5x
80% | 2.5x
70% | 1.8x
60% | 1.3x
50% | 0.9x (slower!)
Factors affecting acceptance rates:
- Model family similarity: Same-family draft models (Llama 7B for Llama 70B) outperform cross-family pairs
- Temperature: Lower temperatures increase acceptance rates (more deterministic sampling)
- Task difficulty: Code and structured text have higher rates than creative writing
Production Deployment
vLLM Integration
vLLM supports multiple speculative decoding strategies out of the box:
# Using a draft model
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5
# Using EAGLE-3
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--speculative-model yuhuili/EAGLE3-LLaMA3.3-Instruct-70B \
--speculative-draft-model-type eagle
# N-gram speculation (no extra model)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--speculative-model [ngram] \
--ngram-prompt-lookup-max 4
Memory Considerations
Speculative decoding adds memory overhead:
| Component | Memory Impact |
|---|---|
| Draft model weights | +1-7GB (depending on draft size) |
| Draft KV cache | +0.5-2GB |
| Verification buffer | +0.5GB |
| Total overhead | 2-10GB additional VRAM |
For memory-constrained deployments:
- Use quantized draft models (FP8/INT4)
- Consider self-speculative methods (LayerSkip)
- Use N-gram speculation when repetition is expected
Hardware Recommendations
| Hardware | Recommendation |
|---|---|
| H100/H200 | Full speculative decoding with EAGLE-3 |
| A100 80GB | Speculative decoding with smaller drafts |
| A100 40GB | Quantized drafts or self-speculative |
| Consumer GPUs (24GB) | N-gram or self-speculative only |
The Future: Adaptive and Hybrid Approaches
The frontier of speculative decoding research focuses on:
- Adaptive speculation: Dynamically adjust draft length and model based on context and acceptance rates
- Hybrid methods: Combine multiple speculation strategies (e.g., EAGLE + N-gram)
- SpecOffload: Use speculative decoding during I/O waits in CPU offloading scenarios
- Mixture-of-Experts speculation: Specialized draft models for different task types
The Nightjar system (2025) introduces dynamic adaptive speculative decoding that automatically:
- Senses when speculation helps vs. hurts
- Adjusts draft length in real-time
- Falls back to standard generation when acceptance rates drop
class AdaptiveSpeculator:
def __init__(self, target_model, draft_models):
self.target = target_model
self.drafts = draft_models # Multiple draft options
self.acceptance_history = []
self.current_draft = 0
def should_speculate(self):
recent_acceptance = np.mean(self.acceptance_history[-100:])
return recent_acceptance > 0.55 # Dynamic threshold
def select_draft_model(self):
# Choose draft based on recent performance
if np.mean(self.acceptance_history[-50:]) > 0.8:
return self.drafts['large'] # More capable but slower
else:
return self.drafts['small'] # Faster fallback
Key Takeaways
Speculative decoding represents one of the most impactful inference optimization techniques for LLMs. The key insights:
- Mathematically exact: The rejection sampling scheme guarantees identical output distribution to standard generation
- 2-6x speedup: Real-world deployments consistently achieve significant improvements
- Context matters: Works best for single-request latency optimization, not high-throughput batch inference
- Draft quality is critical: Acceptance rates above 60% are essential; model family matching matters
- Memory trade-off: Requires additional VRAM for draft model and verification buffers
The evolution from vanilla speculative sampling to EAGLE-3 demonstrates how architectural innovations—feature reuse, training-time testing, multi-layer fusion—can compound to deliver practical 5x+ speedups. As LLMs scale to hundreds of billions of parameters, speculative decoding isn’t just an optimization—it’s increasingly a necessity for viable deployment.