For years, the next-token prediction (NTP) paradigm has been the unquestioned foundation of large language model training. Given a sequence of tokens $x_{1:t}$, the model learns to maximize $P(x_{t+1} | x_{1:t})$. Simple, elegant, and remarkably effective—until you realize the fundamental inefficiency baked into this approach.

The problem is that transformers spend the same computational budget predicting filler words (“the”, “and”, “is”) as they do on information-carrying tokens (“quantum”, “entanglement”, “superposition”). Research from Apple and EPFL reveals that over 50% of English text consists of function words—linguistic glue that carries minimal semantic weight. Yet models trained on NTP treat every token with equal reverence, creating a massive computational inefficiency.

More damning evidence emerged from the “transplantation” experiments by Pal et al. (2023). When researchers extracted hidden states from a model processing “Madison Square Garden is located in…"—just before predicting “New”—and injected these states into a completely unrelated context, the model spontaneously generated “Tell me something about New York City.” This proved that transformers encode future trajectories long before generating them. The information is already there; NTP simply fails to exploit it.

This revelation sparked a research renaissance. What if we explicitly trained models to predict multiple future tokens simultaneously? The resulting paradigm—Multi-Token Prediction (MTP)—is now deployed in DeepSeek-V3, validated by Apple’s research, and expanding into five distinct families of alternatives that challenge the NTP orthodoxy.

The Core Architecture: From One Head to Many

Meta FAIR’s seminal 2024 paper introduced the canonical MTP architecture. Instead of a single output head predicting $x_{t+1}$, the model employs $n$ independent heads, each responsible for a position-specific future token:

$$\mathcal{L}_{MTP} = -\frac{1}{n} \sum_{i=1}^{n} \log P(x_{t+i} | x_{1:t})$$

The architecture divides into two components:

Shared Trunk: A standard transformer backbone processes the input context $x_{1:t}$ into a dense representation $z_t$. This representation must encode information sufficient for all $n$ future predictions.

Independent Heads: Each head $h_i$ receives $z_t$ and predicts token $x_{t+i}$. The first head predicts $x_{t+1}$, the second predicts $x_{t+2}$, and so on.

# Conceptual MTP forward pass
def mtp_forward(input_ids, shared_trunk, heads, n=4):
    # Shared computation
    hidden_states = shared_trunk(input_ids)  # [batch, seq_len, hidden_dim]
    z_t = hidden_states[:, -1, :]  # Final hidden state
    
    # Multi-token prediction
    logits_per_head = []
    for i in range(n):
        head_logits = heads[i](z_t)  # Each head: [batch, vocab_size]
        logits_per_head.append(head_logits)
    
    return logits_per_head

The Memory Challenge and Sequential Backpropagation

The naive implementation hits an immediate wall: memory. With vocabulary sizes of 32K-256K tokens, computing logits for $n$ heads simultaneously requires $O(n \cdot V)$ memory—prohibitive for training at scale.

Meta’s solution is elegant: sequential forward/backward passes. The shared trunk computes $z_t$ once, then each head activates sequentially:

  1. Compute logits for head 1, calculate loss, backpropagate gradients
  2. Immediately discard head 1’s logits from memory
  3. Repeat for heads 2, 3, …, n

Peak memory remains $O(V)$ instead of $O(n \cdot V)$, enabling MTP training at batch sizes comparable to standard NTP.

Parallel vs. Causal Head Topology

A critical design decision: how should heads relate to each other?

Parallel Heads: Each head predicts independently from $z_t$, without seeing other heads’ outputs. The trunk must encode a globally useful representation.

Causal Heads: Head 2 receives head 1’s output as additional input, creating a “mini-autoregressive” chain.

Counter-intuitively, parallel heads outperform causal heads. The researchers hypothesize that causal topology allows the shared trunk to “get lazy”—delegating sequential reasoning to the heads. Forcing independent prediction compels the trunk to develop planning capabilities, the exact property that enables reasoning tasks.

The Scaling Law of Foresight

MTP’s benefits scale dramatically with model size. For models under 1.3B parameters, MTP offers negligible gains—sometimes regression. But as scale increases, the advantage becomes stark:

Benchmark NTP (13B) MTP (13B) Improvement
HumanEval 26.0% 29.1% +12% relative
MBPP 26.0% 30.5% +17% relative

The hypothesis: larger models can allocate more capacity to future planning, while smaller models struggle to balance immediate prediction with long-horizon lookahead.

DeepSeek-V3: Sequential MTP in Production

DeepSeek took MTP from research to production. Their technical report reveals a modified architecture: sequential MTP modules instead of parallel heads.

Each MTP module $k$ combines:

  • The model’s hidden state at depth $k-1$
  • The embedding of the future token
  • A projection matrix $M_k$ specific to depth $k$

The key formula:

$$h_t^{(k)} = \text{Transformer}_k(M_k \cdot [\text{RMSNorm}(h_t^{(k-1)}) \| \text{RMSNorm}(E(x_{t+k}))])$$

Where $E(x_{t+k})$ is the embedding of the $(t+k)$-th token, and $\|$ denotes concatenation.

DeepSeek-V3 uses $D=2$ MTP modules, predicting two additional tokens beyond the main model. During inference, these modules can be discarded for standard operation, or retained for speculative decoding—achieving up to 1.8x speedup with zero accuracy loss.

The Inference Revolution: Self-Speculative Decoding

MTP’s most compelling benefit is the self-speculative decoding capability. Traditional speculative decoding requires a separate draft model—a smaller, faster LLM that proposes candidate tokens. The main model then verifies these proposals in a single forward pass.

MTP eliminates the need for a separate draft model. The prediction heads are the drafters. During inference:

  1. Head 1-4 generate candidate tokens $x_{t+1}, x_{t+2}, x_{t+3}, x_{t+4}$
  2. The main model verifies all candidates simultaneously
  3. Accept valid tokens, reject and resample from the first mismatch

Since the heads share the trunk, their predictions are highly aligned with the main model. Results: 3x inference speedup with zero quality degradation.

Apple’s Self-Distillation: Retrofitting Any LLM

Apple’s February 2026 paper addresses a practical limitation: what about existing pretrained models? Training from scratch with MTP is expensive.

Their solution: online self-distillation that converts any pretrained LLM into a multi-token predictor. The key insight is treating speed as a learning problem, not an architecture problem.

# Conceptual self-distillation training
def self_distill_step(model, batch, n_tokens=4):
    # Student generates n_tokens simultaneously
    student_logits = model.multi_token_forward(batch, n_tokens)
    
    # Teacher (frozen original) generates tokens autoregressively
    with torch.no_grad():
        teacher_logits = []
        context = batch
        for i in range(n_tokens):
            next_logits = model.single_token_forward(context)
            teacher_logits.append(next_logits)
            context = torch.cat([context, next_logits.argmax(-1)], dim=-1)
    
    # Distillation loss
    loss = distillation_loss(student_logits, teacher_logits)
    return loss

Apple achieves 5x speedup on code/math tasks and 2.5x on general chat with under 5% accuracy drop. Crucially, the final model retains identical implementation to the original—no auxiliary verifier or specialized inference code required.

Future Summary Prediction: Beyond Token-Level Prediction

MTP’s limitation: predicting individual tokens still captures short-range dependencies. A 2025 paper from researchers at multiple institutions proposes Future Summary Prediction (FSP)—predicting a compact representation of the long-term future rather than specific tokens.

Two variants:

Handcrafted Summaries: Bag-of-words representation of future text. The model predicts which words will appear in the next $k$ tokens, ignoring order.

Learned Summaries: A reverse language model (trained right-to-left) produces embeddings of future context. The main model predicts these embeddings.

FSP targets the information content rather than surface form. On 3B and 8B parameter models, FSP outperforms both NTP and MTP on math, reasoning, and coding benchmarks—suggesting that capturing semantic essence matters more than predicting exact token sequences.

The Five Families of NTP Alternatives

A comprehensive survey (arXiv 2509.24435) categorizes alternatives to NTP into five families:

Family Core Idea Representative Work
Multi-Token Prediction Predict $n$ future tokens simultaneously Meta MTP, DeepSeek-V3
Plan-then-Generate Create high-level plan before decoding Skeleton-of-Thought
Latent Reasoning Shift autoregression to continuous latent space LaDiR, Coconut
Continuous Generation Iterative refinement via diffusion Diffusion-LM
Non-Transformer Architectures Inherently non-autoregressive structures Mamba, RWKV

LaDiR: Latent Diffusion for Reasoning

LaDiR (Latent Diffusion Reasoner) exemplifies the latent reasoning family. A VAE encodes reasoning steps into “thought tokens”—compact latent representations. A diffusion model then denoises these representations with bidirectional attention, enabling holistic revision of the reasoning process.

Unlike autoregressive CoT, which cannot revise earlier steps, LaDiR can explore multiple reasoning trajectories in parallel and select the best—trading compute for reasoning quality at inference time.

The Trade-offs: When MTP Falls Short

MTP isn’t universally superior. Critical limitations:

Knowledge Regression: On fact-retrieval benchmarks (MMLU, TriviaQA, ARC), MTP underperforms NTP. The hypothesis: predicting “Paris is a city in…” dilutes the signal from the critical token “Paris.” For RAG systems and trivia bots, NTP may be preferable.

The “Goldilocks” Sensitivity: Performance is highly sensitive to $n$:

  • $n=2$: Negligible gain, insufficient lookahead incentive
  • $n=4$: Optimal performance across benchmarks
  • $n=8$: Rapid degradation as hidden states become overcrowded

Training Overhead: Despite memory optimization, MTP requires $n$ forward/backward passes per position, increasing training time by approximately 20-30%.

The Production Verdict

DeepSeek-V3’s deployment proves MTP is production-ready. The recipe:

  1. Train with MTP (n=4 for code-heavy models, n=2 for general-purpose)
  2. Keep MTP heads at inference for speculative decoding
  3. Monitor knowledge benchmarks—consider hybrid approaches for fact-heavy applications
  4. Tune n for your data distribution: longer reasoning chains benefit from larger n

The paradigm shift is clear: next-token prediction optimized for fluency, but multi-token prediction optimizes for planning. As LLMs tackle increasingly complex reasoning tasks, the ability to “think ahead”—encoded in the training objective itself—may prove more valuable than raw prediction accuracy.