For years, the dominant approach to multimodal AI followed a simple recipe: take a pre-trained vision encoder (CLIP, SigLIP), bolt it onto a pre-trained LLM through an adapter layer, and fine-tune the connection. This “late-fusion” paradigm powered everything from GPT-4V to LLaVA, delivering impressive results with remarkable sample efficiency. But a fundamental question lingered: was this architectural shortcut an inherent advantage, or merely a convenient workaround?
The answer arrived in 2025 with a paradigm shift that’s rewriting the rules of multimodal AI. Native multimodal models—trained from scratch on all modalities simultaneously—are proving that early-fusion architectures don’t just match late-fusion approaches; they exceed them in efficiency, scalability, and ultimately, capability.
The Bolt-On Problem: Why Late Fusion Hit a Ceiling
The late-fusion architecture that dominated 2023-2024 followed a predictable pattern. A frozen vision encoder processes images into embeddings, which are projected through an MLP adapter into the LLM’s embedding space. The LLM then treats these visual embeddings as “soft tokens” alongside text. This approach offered clear advantages: leverage existing pre-trained components, minimal compute investment, and rapid iteration.
But the architectural constraints ran deeper than most realized. Vision encoders like CLIP were trained with a specific objective—contrastive alignment between images and text captions—which fundamentally shaped their representations. When you freeze these encoders and connect them to LLMs, you inherit not just their capabilities but their limitations:
Representation Mismatch: CLIP’s contrastive training optimizes for global semantic similarity, not the fine-grained spatial reasoning needed for tasks like OCR, document understanding, or precise object localization. The encoder becomes a bottleneck that no amount of adapter fine-tuning can fully overcome.
Modality Isolation: In late-fusion models, visual and linguistic information only interact after passing through separate processing pipelines. The LLM never learns to reason about visual features during its foundational training—it merely learns to interpret compressed embeddings at inference time.
Generation Impossibility: Perhaps the most severe constraint—late-fusion architectures excel at understanding but fundamentally cannot generate images. The vision encoder is a one-way function, compressing images into embeddings with no path back to pixel space.
The Early-Fusion Revolution: Chameleon’s Unified Token Space
Meta’s Chameleon, released in May 2024, demonstrated a radically different approach. Instead of bolting vision onto language, Chameleon trains a single transformer from scratch on interleaved text and image tokens. Images are quantized into discrete tokens using a VQ-GAN codebook (typically 8,192 entries), and these visual tokens share the same vocabulary space as text tokens.
graph TB
subgraph "Late Fusion (Traditional)"
A1[Image Input] --> B1[CLIP Encoder]
B1 --> C1[Embeddings]
C1 --> D1[MLP Adapter]
E1[Text Input] --> F1[LLM Backbone]
D1 --> F1
F1 --> G1[Text Output]
end
subgraph "Early Fusion (Native)"
A2[Image Input] --> B2[VQ-GAN Tokenizer]
B2 --> C2[Image Tokens]
E2[Text Input] --> F2[Text Tokenizer]
F2 --> G2[Text Tokens]
C2 --> H2[Unified Transformer]
G2 --> H2
H2 --> I2[Text/Image Output]
end
The implications are profound. A 34B parameter Chameleon model processes images and text through the same attention layers, learning cross-modal relationships at every depth of the network. When generating, the model can produce both text and image tokens autoregressively—a unified generation paradigm that late-fusion architectures simply cannot achieve.
But discrete tokenization comes with trade-offs. VQ-GAN quantization introduces information loss, and the 256-1024 tokens required per image create a computational burden. Image generation quality, while impressive for an autoregressive approach, lagged behind dedicated diffusion models.
Transfusion: When Next-Token Prediction Meets Diffusion
Meta’s Transfusion paper (August 2024) proposed an elegant solution to the quality-efficiency trade-off. Instead of forcing images into discrete tokens, Transfusion represents images as continuous embeddings and applies different loss functions to different modalities:
For text tokens: Standard next-token prediction (cross-entropy loss)
For image patches: DDPM diffusion loss with bidirectional attention
The key insight: different modalities benefit from different inductive biases within the same model. Text generation benefits from causal (left-to-right) attention, while image generation benefits from bidirectional attention that can reason about spatial relationships globally. Transfusion switches attention patterns per modality while sharing the same transformer backbone.
The results validated the approach. Transfusion matched the image generation quality of dedicated diffusion models while maintaining strong language capabilities—all within a single 7B parameter model. The paper demonstrated that “Frankenstein” architectures combining transformer and diffusion weren’t necessary; a single model could excel at both paradigms with the right training recipe.
The Encoding Conflict: Janus’s Decoupled Architecture
DeepSeek’s Janus series (October 2024 - January 2025) identified a subtle but critical problem that previous approaches overlooked. The visual encoder optimized for understanding is fundamentally different from one optimized for generation:
Understanding requires: Semantic features, object relationships, OCR capability, spatial reasoning Generation requires: Spatial layout, texture details, pixel-level fidelity
Using a single visual encoder for both tasks forces a compromise. An encoder trained for semantic understanding may lose the spatial detail needed for generation, while an encoder trained for pixel fidelity may produce embeddings poorly suited for high-level reasoning.
Janus’s solution: decouple the visual encoding entirely. The model uses SigLIP for understanding tasks (semantic-rich embeddings optimized for VQA, captioning, reasoning) and a separate VQ tokenizer for generation (spatial detail preserved). Both encoders feed into a unified transformer head that can switch between understanding and generation modes.
Understanding Path: Image → SigLIP → Semantic Embeddings → LLM → Text
Generation Path: Text → LLM → VQ Tokens → VQ-Decoder → Image
The results speak to the validity of this approach. Janus-Pro-7B achieved 80% accuracy on GenEval (surpassing Stable Diffusion 3 Medium at 74%, DALL-E 3 at 67%, and Transfusion at 63%) while maintaining strong multimodal understanding capabilities (79.2 on MMBench).
The Scaling Laws That Changed Everything
In April 2025, Apple researchers released a paper that fundamentally challenged the late-fusion orthodoxy. “Scaling Laws for Native Multimodal Models” trained 457 models with different architectures and training mixtures to answer a simple question: does late-fusion have an inherent advantage?
The answer was definitive: no.
graph LR
subgraph "Key Findings"
A[Early Fusion] -->|Lower Param Count| B[Stronger Performance]
A -->|Unified Training| C[More Efficient]
A -->|No Encoders| D[Easier Deployment]
E[Late Fusion] -->|Sample Efficient| F[Fast Iteration]
E -->|Frozen Encoders| G[Representation Bottleneck]
end
Early-fusion architectures exhibited stronger performance at lower parameter counts, were more efficient to train (no need to align pre-trained components), and easier to deploy (no separate encoder weights). The paper found that Mixture of Experts (MoE) provided additional benefits, allowing models to learn modality-specific weights without architectural complexity.
Perhaps most importantly, the research established that the sample efficiency advantage of late-fusion was largely an artifact of pre-training, not architecture. When training from scratch with sufficient data, early-fusion matched or exceeded late-fusion performance at equivalent compute.
Llama 4: The Production-Grade Native Multimodal
Meta’s Llama 4 release (April 2025) represented the most ambitious production deployment of native multimodal architecture. Three models were announced: Scout (109B total, 17B active, 16 experts), Maverick (400B total, 17B active, 128 experts), and Behemoth (nearly 2T total, 288B active, 16 experts).
iRoPE: Infinite Context Through Interleaved Attention
Llama 4 introduced iRoPE, an architecture that eliminates positional embeddings from interleaved attention layers. Traditional transformers use positional embeddings to encode sequence order, but these embeddings create a hard ceiling on context length. iRoPE’s approach:
Traditional: Every layer uses RoPE (Rotary Position Embeddings)
iRoPE: Alternating layers with and without positional embeddings
+ Inference-time temperature scaling of attention
This architecture enables Llama 4 Scout to support 10 million tokens of context—a 78x increase from Llama 3’s 128K limit. The model was both pre-trained and post-trained with 256K context, with the interleaved attention enabling generalization to much longer sequences.
Early Fusion at Scale
Llama 4’s vision encoder is based on MetaCLIP but trained in conjunction with a frozen Llama model—a technique that adapts the encoder specifically to the LLM’s embedding space. Visual tokens are injected directly into the transformer alongside text tokens from the first layer.
The training data mixture included over 30 trillion tokens (2x Llama 3), incorporating diverse text, image, and video datasets. Models were pre-trained on up to 48 images simultaneously and tested with up to 8 images during post-training, enabling sophisticated multi-image reasoning.
The MetaP Training Technique
A notable innovation was MetaP, a method for reliably setting critical hyper-parameters like per-layer learning rates and initialization scales. The key finding: chosen hyper-parameters transfer well across different values of batch size, model width, depth, and training tokens—simplifying the notoriously complex process of training massive models.
Training Challenges: The Reality of Native Multimodality
Native multimodal training isn’t without significant challenges. The Apple scaling laws paper and subsequent research identified several critical issues:
Modality Imbalance
Different modalities learn at different rates. Visual features often dominate early training, potentially suppressing linguistic capabilities. The Llama 4 team addressed this through a carefully curated curriculum that maintained balance without sacrificing performance compared to single-modality expert models.
Data Mixture Optimization
With limited computational resources, optimizing the ratio of text-to-image-to-video data becomes crucial. Research on ModalMix showed that the optimal mixture depends on the target task distribution, and suboptimal mixtures can leave significant performance on the table.
The Post-Training Pipeline
Llama 4’s post-training pipeline revealed a counterintuitive finding: over-constraining models during SFT and DPO can restrict exploration during RL, leading to suboptimal performance in reasoning and coding.
Traditional Pipeline: Heavy SFT → DPO → Light RL
Llama 4 Pipeline: Light SFT → Online RL → Light DPO
(removed 50% of "easy" data)
The continuous online RL strategy—alternating between training and using the model to filter for medium-to-hard prompts—proved crucial for achieving frontier-level performance.
Performance Reality: Benchmarks and Trade-offs
| Model | Params | GenEval | MMBench | Context | Open Source |
|---|---|---|---|---|---|
| Janus-Pro-7B | 7B | 80% | 79.2 | 128K | ✓ |
| Llama 4 Maverick | 400B/17B | - | SOTA | 1M+ | ✓ |
| Llama 4 Scout | 109B/17B | - | High | 10M | ✓ |
| Transfusion-7B | 7B | 63% | Good | 32K | ✗ |
| Chameleon-34B | 34B | Good | Good | 4K | ✓ |
| GPT-4o | Unknown | SOTA | SOTA | 128K | ✗ |
The benchmarks reveal the current state of the field: open-source native multimodal models are competitive with, and in some cases exceeding, proprietary alternatives. Janus-Pro’s 80% on GenEval surpasses DALL-E 3’s 67% while maintaining strong understanding capabilities. Llama 4 Maverick outperforms GPT-4o and Gemini 2.0 Flash on coding, reasoning, and image benchmarks.
The Road Ahead
Native multimodal models represent more than an architectural improvement—they’re a fundamental shift in how we think about AI systems. The separation between “vision models” and “language models” is dissolving into unified architectures that process all modalities through shared representations.
The implications extend beyond architecture. Native multimodality enables:
- Any-to-Any Generation: Input images, text, or both; output images, text, or both
- Emergent Cross-Modal Reasoning: Capabilities that arise from training on interleaved multimodal data
- Simplified Deployment: One model, one inference pipeline, one serving infrastructure
The remaining challenges are significant: training costs for native multimodal models remain high, data mixture optimization is still more art than science, and the optimal balance between understanding and generation encoders remains an open question.
But the trajectory is clear. The bolt-on era is ending. The future belongs to models that don’t just connect vision and language—they’re born knowing both.