The Open LLM Leaderboard tells a surprising story: many top-performing models aren’t trained at all. They’re merged. A 7B parameter model, created by strategically blending weights from existing fine-tuned models, can outperform models 10x its size. This isn’t alchemy—it’s mathematics.

Model merging represents a paradigm shift in how we think about model development. Instead of investing millions in GPU hours for training, practitioners are discovering that the collective intelligence embedded in existing open-source models can be combined to create something greater than the sum of its parts. The technique requires no gradients, no backward passes, and no training data. Just arithmetic operations on weight tensors.

The Geometry of Weight Space

To understand why model merging works, we need to understand where fine-tuned models live in parameter space.

When you fine-tune a pre-trained model, you’re essentially moving from one point in weight space to another. The pre-trained model sits at $\theta_{pre}$, and after fine-tuning on task $t$, you arrive at $\theta_t$. The difference between these points—the task vector—encapsulates everything the model learned about that specific task:

$$\tau_t = \theta_t - \theta_{pre}$$

The remarkable discovery from the Task Arithmetic paper (Ilharco et al., 2022) is that these task vectors can be manipulated algebraically. Add two task vectors together, and you get a model that performs both tasks. Negate a task vector and subtract it, and you can selectively “unlearn” capabilities.

This works because of Linear Mode Connectivity (LMC)—a property observed in deep neural networks where independently fine-tuned models often lie in connected regions of parameter space with low-loss barriers between them. In simpler terms: the path between two fine-tuned models doesn’t destroy their capabilities, it preserves them.

graph LR
    A[Pre-trained Model] --> B[Fine-tune Task A]
    A --> C[Fine-tune Task B]
    A --> D[Fine-tune Task C]
    B --> E[Task Vector τA]
    C --> F[Task Vector τB]
    D --> G[Task Vector τC]
    E --> H[Merged Model]
    F --> H
    G --> H
    H --> I[Multi-task Capabilities]

Model Soups: The Simplest Fusion

The most straightforward merging technique—Model Soups—simply averages weights from multiple fine-tuned models:

$$\theta_{merged} = \sum_{i=1}^{N} \alpha_i \theta_i$$

where $\alpha_i$ are weights summing to 1. This works surprisingly well when models are fine-tuned from the same base with different hyperparameters, acting as a form of regularization that smooths out random fluctuations in training.

But Model Soups has a fundamental limitation: it assumes all parameters are equally important. In reality, fine-tuning primarily modifies a sparse subset of parameters—those most relevant to the target task. Averaging indiscriminately can dilute task-specific knowledge.

Task Arithmetic: Directional Updates

Task Arithmetic, introduced in late 2022, provides a more principled approach. Instead of averaging models directly, we work with task vectors:

$$\theta_{merged} = \theta_{pre} + \lambda \sum_{i=1}^{N} \alpha_i \tau_i$$

The scaling factor $\lambda$ controls the strength of the task contributions. This formulation enables several powerful operations:

  • Multi-task merging: Add task vectors to create a model that handles multiple tasks
  • Task negation: Subtract a task vector to reduce unwanted behaviors
  • Analogical reasoning: If A:B::C:D, then $\tau_D \approx \tau_B - \tau_A + \tau_C$

However, Task Arithmetic assumes task vectors align harmoniously. When merging more than a few models, this assumption breaks down catastrophically.

The Interference Problem

When merging multiple models, three types of interference emerge:

1. Redundant Parameters: Fine-tuning often changes many parameters by small amounts—noise rather than signal. These small changes accumulate across models, drowning out meaningful updates.

2. Sign Conflicts: Two models might assign opposite signs to the same parameter. One model increases weight $w_{ij}$, another decreases it. Averaging cancels both contributions.

3. Magnitude Imbalances: Some task vectors have much larger magnitudes than others, dominating the merge and overwhelming smaller contributions.

TIES-Merging (Yadav et al., 2023) addresses all three problems through a clever three-step algorithm.

TIES-Merging: Resolving Interference

Step 1: Trim

For each task vector $\tau_i$, retain only the top $k\%$ of parameters by magnitude, setting the rest to zero:

$$\hat{\tau}_i = \text{Trim}_k(\tau_i)$$

This removes redundant noise while preserving the most salient updates. Typical values of $k$ range from 20% to 80%, with research showing that retaining just 20-30% of parameters often preserves most task performance.

Step 2: Elect Sign

For each parameter position $j$, compute the dominant sign across all trimmed task vectors:

$$s_j = \text{sign}\left(\sum_{i=1}^{N} \hat{\tau}_{i,j}\right)$$

This resolves sign conflicts democratically—the direction with more accumulated magnitude wins.

Step 3: Disjoint Merge

Only merge parameters whose signs agree with the elected sign. Parameters with conflicting signs are excluded from that position:

$$\tau_{merged,j} = \frac{1}{|S_j|} \sum_{i \in S_j} \hat{\tau}_{i,j}$$

where $S_j = \{i : \text{sign}(\hat{\tau}_{i,j}) = s_j\}$ is the set of models agreeing with the elected sign.

The result is a merged model that preserves the strongest, most consistent updates while eliminating interference.

DARE: The 90% Solution

DARE (Drop And REscale), introduced in the “Language Models are Super Mario” paper (Yu et al., 2023), takes an even more aggressive approach to sparsification.

The algorithm randomly drops a proportion $p$ of delta parameters (typically $p = 0.9$), then rescales the remaining parameters by $1/(1-p)$:

$$\hat{\Delta}_i = \frac{1}{1-p} \cdot m_i \odot \Delta_i$$

where $m_i \sim \text{Bernoulli}(p)$ is a random mask.

This sounds counterintuitive—how can dropping 90% of changes preserve performance? The answer lies in the rescaling factor. By multiplying surviving parameters by 10 (when $p = 0.9$), the expected value of the output remains unchanged:

$$\mathbb{E}[\hat{\Delta}_i] = \frac{1}{1-p} \cdot (1-p) \cdot \Delta_i = \Delta_i$$

DARE’s insight is that fine-tuned models are massively overparameterized. Most parameter changes are redundant, and a sparse subset carries the essential knowledge. Random dropping acts as a regularizer, while rescaling preserves the signal-to-noise ratio.

DARE is often combined with TIES (creating DARE-TIES), where DARE handles the sparsification and TIES resolves sign conflicts.

SLERP: Geometric Interpolation

Spherical Linear Interpolation (SLERP) takes a geometric view of weight space. Instead of averaging in Euclidean space, SLERP interpolates along a great circle on the unit sphere.

For two weight vectors $\mathbf{a}$ and $\mathbf{b}$ with interpolation parameter $t$:

$$\text{SLERP}(\mathbf{a}, \mathbf{b}; t) = \frac{\sin((1-t)\theta)}{\sin\theta}\mathbf{a} + \frac{\sin(t\theta)}{\sin\theta}\mathbf{b}$$

where $\theta = \arccos(\mathbf{a} \cdot \mathbf{b})$ is the angle between vectors.

SLERP preserves angular relationships between weight vectors, maintaining a constant velocity of interpolation. This matters when merging models with different scaling characteristics—Euclidean averaging can pull the result toward the model with larger weight magnitudes, while SLERP treats both models symmetrically.

Passthrough Merging: Building Frankensteins

All previous techniques blend weights within the same architecture. Passthrough merging (colloquially “Frankenmerging”) takes a different approach: it directly copies layers from different models to create a new architecture.

# MergeKit configuration for passthrough merging
slices:
  - sources:
    - model: mistral-7b-base
      layer_range: [0, 24]
  - sources:
    - model: mistral-7b-instruct
      layer_range: [24, 32]
merge_method: passthrough
dtype: float16

This creates a model where early layers come from one model and later layers from another. The resulting model can have more layers than either source—SOLAR-10.7B was created by depth-upscaling a Mistral 7B, producing a model that outperformed Mixtral 8x7B on several benchmarks.

Passthrough merging exploits the fact that transformer layers are somewhat modular. Early layers learn general representations; later layers specialize. By combining layers strategically, you can sometimes get the best of both worlds—though success depends heavily on architectural compatibility.

Evolutionary Model Merging: Automating Discovery

Sakana AI’s evolutionary model merge, published in Nature Machine Intelligence (2025), automates what was previously an art form requiring extensive experimentation.

The approach optimizes merging along two axes:

Parameter Space (PS): Evolution optimizes layer-wise density parameters and mixing weights. The CMA-ES algorithm searches for configurations that maximize benchmark performance.

Data Flow Space (DFS): Instead of blending weights, DFS merging optimizes the inference path. Tokens might flow through layer 5 of model A, then layer 7 of model B, then back to layer 8 of model A. This discovers non-obvious architectures that human designers wouldn’t consider.

The results are striking. An evolved Japanese math LLM, created by merging a Japanese language model with English math models, achieved state-of-the-art performance on Japanese benchmarks—outperforming 70B parameter models with just 7B parameters. The model wasn’t explicitly trained on Japanese math problems; the capability emerged from the merge.

Practical Implementation with MergeKit

MergeKit has become the standard tool for model merging, providing a unified interface for all major techniques:

# DARE-TIES configuration example
models:
  - model: NousResearch/Hermes-3-Llama-3.1-8B
    parameters:
      weight: 0.4
  - model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    parameters:
      weight: 0.6
merge_method: dare_ties
base_model: meta-llama/Llama-3.1-8B-Instruct
parameters:
  lambda: 0.8
  density: 0.7
dtype: float16

Key parameters:

  • density: Fraction of parameters to retain (higher = less pruning)
  • lambda: Scaling factor for the merged task vector
  • weight: Relative contribution of each model

When Merging Fails

Model merging isn’t a universal solution. Several conditions must align for success:

Architectural Compatibility: Models must share the same architecture. You can merge Llama-2-7B variants with each other, but not with Mistral-7B. Passthrough merging can work across architectures with compatible layer dimensions, but results are unpredictable.

Shared Pre-training: Merging works best when models share a common pre-trained ancestor. The task vectors represent deviations from this shared baseline. Merging independently trained models—without a shared starting point—often produces incoherent results.

Task Complementarity: Models fine-tuned on similar tasks tend to merge poorly—their knowledge overlaps without adding new capabilities. The most successful merges combine models with complementary strengths: a coding model with a math model, or a language model with a reasoning model.

Dimension Mismatch: Recent research shows that merging effectiveness decreases as the number of models increases. The “curse of merging” suggests that each additional model introduces new interference patterns that compound exponentially.

The Economics of Merging

The appeal of model merging is fundamentally economic. Training a frontier LLM costs millions of dollars. Merging two existing models costs minutes of CPU time on a laptop.

This democratizes model development. A researcher with limited compute can create competitive models by strategically combining existing releases. The technique has enabled the creation of specialized models—Japanese LLMs, domain-specific chatbots, instruction-tuned variants—that would otherwise require prohibitive resources.

The approach also challenges the conventional wisdom that bigger is always better. A well-merged 7B model can outperform a naively trained 70B model on specific tasks. The question shifts from “how much compute do we need?” to “how effectively can we leverage existing knowledge?”

Trade-offs and Considerations

Method Best For Limitations
Model Soups Same-task ensembles Dilutes task-specific knowledge
Task Arithmetic 2-3 models Interference grows with model count
TIES-Merging Multi-task merging Requires careful density tuning
DARE Large model counts Random dropping can lose information
SLERP Two-model merging Doesn’t scale to many models
Passthrough Architecture innovation Unpredictable behavior
Evolutionary Automated optimization Requires benchmark data for optimization

Emerging Directions

The field is rapidly evolving. Recent developments include:

Model Stock: Efficient merging that approximates the optimal center of weight distribution using geometric properties, requiring only a few fine-tuned models.

DELLA-Merging: Magnitude-aware pruning that assigns higher survival probability to large-magnitude parameters, combining the benefits of TIES and DARE.

Cross-Modal Merging: Techniques for merging models across modalities—combining vision encoders with language models without explicit multimodal training.

Merging as Regularization: Using merging during training to improve generalization, rather than only for post-hoc model combination.


Model merging represents a fundamental insight: knowledge in neural networks is distributed and compositional. The weights of a fine-tuned model encode not just a point solution, but a direction in a high-dimensional space. By understanding and manipulating these directions, we can create models that inherit—and transcend—their ancestors.

The technique won’t replace training entirely. Pre-training and fine-tuning remain essential for creating the building blocks. But merging offers a powerful alternative for combining these blocks, enabling capabilities that would be expensive or impossible to achieve through training alone.

As the open-source ecosystem grows, the space of possible merges expands combinatorially. The question is no longer whether merging works, but how to navigate this vast space efficiently. Evolutionary methods, better theoretical understanding, and improved tooling are pointing toward a future where model creation becomes less about training from scratch and more about intelligent composition of existing knowledge.