When Not Every Token Deserves the Same Compute: How Mixture-of-Depths Rewrites Transformer Efficiency

Every transformer you’ve ever used treats every token with the same computational respect. Whether processing “the” or untangling complex mathematical reasoning, the model devotes identical FLOPs to each position in the sequence. This uniform allocation isn’t a design choice—it’s a constraint baked into the transformer architecture from its inception.

In April 2024, researchers from Google DeepMind, McGill University, and Mila demonstrated that this constraint is not only unnecessary but actively wasteful. Their proposed Mixture-of-Depths (MoD) framework reveals a startling truth: transformers can learn to dynamically allocate compute across tokens, achieving 50% faster inference with equivalent performance.

The Uniform Compute Problem

To understand why MoD matters, consider what happens during a typical transformer forward pass. For a sequence of length $n$ with model dimension $d$ and $L$ layers, the computational cost scales as:

$$\text{FLOPs} = O(L \cdot n^2 \cdot d)$$

Every token participates in every layer’s self-attention and MLP computations. This uniformity creates a fundamental inefficiency: simple tokens that could be processed with minimal computation receive the same treatment as complex tokens requiring deep reasoning.

The analogy is compelling: imagine a teacher giving every student the same amount of individual attention regardless of whether they’re struggling with basic concepts or tackling advanced problems. The system works, but it’s far from optimal.

Mixture-of-Depths: The Core Innovation

MoD introduces a radical departure from this uniformity. Instead of forcing every token through every layer, the model learns to route tokens dynamically. Some tokens pass through all layers; others skip certain layers entirely.

The key insight is that not all tokens require equal computational depth. In a language model processing a document, tokens like “the”, “a”, or common punctuation carry minimal semantic weight. They don’t need the same computational scrutiny as tokens involved in coreference resolution, logical inference, or domain-specific reasoning.

The Routing Mechanism

At each MoD-enabled layer, a router assigns a scalar weight to each token, indicating its “importance” for that layer:

$$r_t = \sigma(W_r \cdot h_t + b_r)$$

where $h_t$ is the hidden state for token $t$, $W_r$ is a learned projection, and $\sigma$ is the sigmoid function producing a score in $(0, 1)$.

The router then selects the top-$k$ tokens to participate in the layer’s computation:

$$\mathcal{S} = \text{top-k}(\{r_1, r_2, \ldots, r_n\})$$

Only tokens in set $\mathcal{S}$ pass through the self-attention and MLP operations. The remaining tokens bypass these computations via a residual connection, effectively skipping the layer.

Static Graph, Dynamic Behavior

A critical design choice distinguishes MoD from other conditional computation approaches. The capacity $k$ is fixed a priori, meaning:

Tensor sizes remain predictable during compilation
No dynamic graph construction overhead
Compatible with existing hardware accelerators

This constraint ensures deployment practicality while still enabling context-sensitive compute allocation. The identities of the $k$ selected tokens are dynamic, but the number of tokens processed is static.

# Simplified MoD Layer Concept
class MoDLayer(nn.Module):
    def __init__(self, dim, capacity_factor=0.5):
        self.router = nn.Linear(dim, 1)
        self.capacity = capacity_factor  # fraction of tokens to process
        
    def forward(self, x):
        # x: [batch, seq_len, dim]
        batch_size, seq_len, dim = x.shape
        k = int(seq_len * self.capacity)
        
        # Compute routing scores
        scores = torch.sigmoid(self.router(x)).squeeze(-1)  # [batch, seq_len]
        
        # Select top-k tokens
        topk_values, topk_indices = torch.topk(scores, k, dim=-1)
        
        # Process only selected tokens
        selected_tokens = torch.gather(x, 1, topk_indices.unsqueeze(-1).expand(-1, -1, dim))
        processed = self.attention_and_mlp(selected_tokens)
        
        # Scatter back and combine with residual
        output = x.clone()
        output.scatter_(1, topk_indices.unsqueeze(-1).expand(-1, -1, dim), processed)
        
        return output

MoD vs. MoE: Understanding the Difference

Mixture-of-Depths is often confused with Mixture-of-Experts (MoE), but they address fundamentally different optimization axes:

Aspect	Mixture-of-Experts	Mixture-of-Depths
Optimization Target	Parameter efficiency	Compute efficiency
Routing Granularity	Expert selection per token	Layer participation per token
Compute Budget	Variable (depends on activated experts)	Fixed (determined by capacity $k$)
Parameter Count	Increases (multiple experts)	Unchanged
Primary Benefit	Larger effective models	Faster inference

MoE increases model capacity by adding experts, then activates a subset per token. MoD keeps the model identical but skips layers for certain tokens. They’re orthogonal techniques—MoD can be combined with MoE for compounded benefits.

The mathematical distinction is clear. In MoE, each token $t$ routes to top-$k$ experts from a pool of $E$ experts:

$$\text{MoE}(h_t) = \sum_{i \in \text{top-k}} g_i(h_t) \cdot E_i(h_t)$$

where $g_i$ is the gating function and $E_i$ is expert $i$. The compute varies based on which experts are selected.

In MoD, the question is binary per layer: does token $t$ participate in this layer at all? The total compute is predictable:

$$\text{FLOPs}_{\text{MoD}} = \sum_{l=1}^{L} k_l \cdot O(d^2 + n \cdot d)$$

where $k_l$ is the fixed capacity at layer $l$.

Performance Benchmarks: The Numbers

The empirical results from the original MoD paper are compelling:

Training Efficiency

Models match baseline performance at iso-FLOP conditions
Up to 50% reduction in FLOPs per forward pass
Wall-clock training time remains competitive

Inference Speedup

Up to 60% faster autoregressive sampling
Consistent speedup across varying sequence lengths
No quality degradation in generated outputs

Scaling Behavior

Benefits increase with model depth
Deeper models have more routing opportunities
Capacity factor 0.5 (processing half the tokens) often achieves optimal trade-offs

The perplexity curves tell the story: MoD models trained with 50% capacity achieve nearly identical perplexity to full-capacity baselines, but each forward pass costs roughly half the compute.

Recent Developments: Attention-Based Routing

A December 2024 paper introduced A-MoD (Attention-based MoD), eliminating the need for a separate router network. Instead of learning routing weights from scratch, A-MoD derives token importance directly from attention maps:

$$r_t = \frac{1}{n} \sum_{j=1}^{n} A_{t,j}^{(l-1)}$$

where $A^{(l-1)}$ is the attention matrix from the previous layer. Tokens that receive more attention are deemed more important and selected for processing.

This approach offers several advantages:

Zero additional parameters for routing
Faster convergence during transfer learning (up to 2×)
Better performance on vision tasks (2% higher ImageNet accuracy)
Direct adaptation from pretrained transformers

The insight is elegant: attention patterns already encode token importance. A token that other tokens attend to heavily is likely central to the sequence’s semantics.

Multimodal Applications: p-MoD and γ-MoD

The MoD principle extends naturally to multimodal large language models (MLLMs). Two recent variants demonstrate this:

p-MoD (Progressive MoD) applies a decaying capacity ratio across layers. Early layers use higher capacity (processing more tokens), while deeper layers become increasingly selective:

$$k_l = k_0 \cdot \gamma^l$$

where $\gamma < 1$ is the decay factor. This reflects the intuition that early layers extract general features requiring broad token coverage, while deeper layers perform specialized reasoning on key tokens.

γ-MoD adapts MoD for vision-language models by implementing a shared router across modalities. The routing decisions jointly consider visual and textual tokens, enabling cross-modal importance learning.

Results on MLLM benchmarks show:

40-60% training FLOPs reduction
Minimal accuracy drop on VQA and captioning tasks
Faster inference without batch size reduction

The Mathematics of Token Importance

Why does dynamic routing work? The theoretical foundation lies in the observation that transformer representations exhibit varying information density across tokens.

In a transformer layer, the hidden state $h_t$ encodes contextual information. Not all tokens contribute equally to the loss function during training:

$$\mathcal{L} = -\sum_{t=1}^{n} \log P(x_t | x_{Tokens with high entropy output distributions—those where the model is uncertain—contribute more to the gradient signal. These “hard” tokens benefit more from additional computational depth.

MoD’s routing mechanism implicitly learns to identify these hard tokens. The router weights $W_r$ are trained end-to-end with the rest of the model, learning to route compute where it matters most.

Limitations and Challenges

MoD isn’t without trade-offs:

Training Complexity: The routing function adds another optimization variable. The router must learn meaningful importance scores, which can be unstable early in training. Techniques like router warmup (training the base model first, then adding routing) help mitigate this.

Load Balancing: Unlike MoE’s expert load balancing, MoD’s token selection doesn’t require explicit balancing losses—the fixed capacity $k$ naturally bounds the compute. However, uneven routing across layers can lead to suboptimal hardware utilization.

Inference Considerations: The scatter-gather operations for token routing add overhead. For very small batch sizes, this overhead can offset the FLOP savings. MoD shines in large-batch scenarios common in production serving.

Task Specificity: The optimal capacity factor varies by task. Simple classification tasks may tolerate lower capacity than complex reasoning tasks. Task-aware capacity tuning becomes a hyperparameter consideration.

Practical Implementation

For practitioners looking to adopt MoD, the key hyperparameters are:

Capacity Factor: Typically 0.3-0.7. Lower values mean more speedup but risk quality degradation.
Routing Frequency: Not every layer needs MoD. Applying it to every other layer or only deeper layers often works well.
Router Architecture: The simplest router is a linear projection with sigmoid activation. More complex routers (multi-layer, attention-based) can improve selection quality.
Training Strategy: Start training with capacity=1.0 (no routing), then gradually reduce to target capacity over the first 10-20% of training steps.

Future Directions

The MoD paradigm opens several research avenues:

Adaptive Capacity: Instead of fixed $k$ per layer, learn to adjust capacity dynamically based on input complexity. A document with simple grammar might need lower capacity than one with technical jargon.

Hierarchical Routing: Combine token-level routing with sequence-level routing. Entire sequence segments could be identified for differential processing.

Cross-Layer Routing: Current MoD makes independent routing decisions per layer. Joint routing across layers could identify tokens that need consistent deep processing versus those that can skip multiple layers.

Integration with Other Efficiency Techniques: MoD combines naturally with quantization, pruning, and knowledge distillation. A quantized MoD model could achieve multiplicative efficiency gains.

Why MoD Matters for the AI Industry

The implications of MoD extend beyond academic interest:

Cost Reduction: For inference-heavy applications, a 50% speedup translates directly to halved compute costs. At scale, this is millions of dollars saved.

Environmental Impact: Lower compute means lower energy consumption. MoD contributes to the growing imperative of sustainable AI.

Edge Deployment: Faster inference enables deployment on resource-constrained devices. Models that were too slow for mobile might become viable with MoD.

Model Scaling: If each token costs less compute, we can train deeper models within the same budget. MoD effectively unlocks additional depth for free.

Mixture-of-Depths challenges a fundamental assumption of the transformer architecture. The uniform compute allocation that seemed necessary is revealed as an unnecessary constraint. By learning which tokens matter, transformers can think deeper about what’s important while skipping what isn’t.

The elegance of MoD lies in its simplicity: a learned scalar per token, a top-k selection, and suddenly transformers become dramatically more efficient. No architecture redesign, no exotic hardware requirements, just intelligent compute allocation.

As LLMs continue scaling toward trillion-parameter regimes, efficiency techniques like MoD transition from nice-to-have to essential. The future of AI isn’t just about building bigger models—it’s about building smarter ones that know where to spend their compute.

The Uniform Compute Problem#

Mixture-of-Depths: The Core Innovation#

The Routing Mechanism#

Static Graph, Dynamic Behavior#

MoD vs. MoE: Understanding the Difference#

Performance Benchmarks: The Numbers#

Recent Developments: Attention-Based Routing#

Multimodal Applications: p-MoD and γ-MoD#

The Mathematics of Token Importance#

Limitations and Challenges#

Practical Implementation#

Future Directions#

Why MoD Matters for the AI Industry#