When GPT-4V describes a meme’s irony or Claude identifies a bug in a screenshot, something remarkable happens: an architecture designed purely for text somehow “sees” and “understands” images. The magic isn’t in teaching language models to process pixels directly—it’s in a clever architectural bridge that transforms visual data into something language models already understand: tokens.

Vision Language Models (VLMs) represent one of the most impactful innovations in modern AI, yet their architecture remains surprisingly underexplored compared to their text-only cousins. Let’s dissect how these systems actually work, from the moment an image enters the model to the final text output.

The Fundamental Challenge: Different Modalities, One Model

Language models operate on discrete tokens—sequences of integers representing words or subwords. Images, by contrast, are continuous grids of pixel values. A 224×224 RGB image contains 150,528 numbers that don’t naturally map to any vocabulary.

The key insight behind VLMs is that we don’t need to teach an LLM about pixels. Instead, we can translate images into a “language” the LLM already understands: sequences of embeddings with the same dimensionality as text tokens. This translation happens through a carefully designed pipeline.

The Three-Component Architecture

Every modern VLM follows a tripartite structure:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  Vision Encoder │───▶│   Projector     │───▶│       LLM       │
│   (ViT/CLIP)    │    │  (MLP/Q-Former) │    │   (Transformer) │
└─────────────────┘    └─────────────────┘    └─────────────────┘
        │                      │                      │
    Image → Patches      Visual → Language       Token Sequence
      → Features        Embedding Space         → Text Output

Component 1: Vision Encoder

The vision encoder converts raw images into semantically meaningful feature vectors. Most VLMs use Vision Transformers (ViT) rather than CNNs because transformers naturally produce sequence-like outputs.

How ViT Processes Images:

  1. Patchification: An image is divided into fixed-size patches (typically 14×14 or 16×16 pixels). A 224×224 image with 14×14 patches produces 256 patches.

  2. Linear Projection: Each flattened patch (14×14×3 = 588 values) is projected through a learned matrix into the embedding dimension (e.g., 1024 for CLIP-ViT-L).

  3. Positional Encoding: Since patches lose spatial information after flattening, learnable or sinusoidal positional embeddings are added.

  4. Transformer Processing: The patch sequence passes through standard transformer layers with self-attention.

CLIP vs SigLIP:

Most VLMs use CLIP-trained vision encoders, but SigLIP is increasingly preferred. The difference lies in the training objective:

  • CLIP uses InfoNCE loss with softmax normalization:

    $$\mathcal{L}_{\text{CLIP}} = -\frac{1}{N}\sum_{i=1}^{N}\log\frac{\exp(\text{sim}(v_i, t_i)/\tau)}{\sum_{j=1}^{N}\exp(\text{sim}(v_i, t_j)/\tau)}$$
  • SigLIP replaces this with a simpler sigmoid loss:

    $$\mathcal{L}_{\text{SigLIP}} = -\frac{1}{N^2}\sum_{i,j}\log\sigma(\text{sim}(v_i, t_j) \cdot y_{ij})$$

Where $y_{ij} = 1$ for matching pairs and $-1$ otherwise. This eliminates the global softmax dependency, enabling better performance at smaller batch sizes and improved memory efficiency.

Component 2: The Projector (Connector)

The projector is the crucial bridge between vision and language spaces. Vision encoder outputs have dimensions like [batch, num_patches, vision_dim], but the LLM expects [batch, seq_len, llm_dim]. The projector performs this transformation.

Evolution of Projector Designs:

Projector Type Parameters Latency Quality Use Case
Linear ~8M Minimal Moderate LLaVA-1.5
MLP (2-layer) ~20M Low Good LLaVA-1.6
Q-Former ~50M Medium Better BLIP-2, InstructBLIP
Perceiver Resampler ~30M Medium Good Flamingo
Cross-Attention Variable Higher Best for video IDEFICS

MLP Projector (LLaVA Style):

The simplest approach uses a two-layer MLP:

class MLPPProjector(nn.Module):
    def __init__(self, vision_dim=1024, llm_dim=4096, hidden_dim=4096):
        super().__init__()
        self.fc1 = nn.Linear(vision_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, llm_dim)
        self.act = nn.GELU()
    
    def forward(self, vision_features):
        # vision_features: [batch, num_patches, vision_dim]
        x = self.act(self.fc1(vision_features))
        return self.fc2(x)  # [batch, num_patches, llm_dim]

Q-Former (BLIP-2 Style):

The Q-Former is more sophisticated, using learnable query embeddings that attend to vision features:

class QFormer(nn.Module):
    def __init__(self, num_queries=32, vision_dim=1024, llm_dim=4096):
        super().__init__()
        self.queries = nn.Parameter(torch.randn(num_queries, llm_dim))
        self.cross_attention = nn.MultiheadAttention(llm_dim, num_heads=8)
        self.self_attention = nn.MultiheadAttention(llm_dim, num_heads=8)
        self.ffn = nn.Sequential(
            nn.Linear(llm_dim, llm_dim * 4),
            nn.GELU(),
            nn.Linear(llm_dim * 4, llm_dim)
        )
    
    def forward(self, vision_features):
        # Cross-attention: queries attend to vision features
        queries = self.queries.unsqueeze(0).expand(vision_features.size(0), -1, -1)
        attended, _ = self.cross_attention(queries, vision_features, vision_features)
        # Self-attention among queries
        refined, _ = self.self_attention(attended, attended, attended)
        return self.ffn(refined)

The Q-Former compresses variable-length vision features into a fixed number of query outputs (typically 32-64 tokens), providing consistent input size regardless of image resolution.

Component 3: The Language Model

The LLM receives visual tokens interleaved with text tokens and processes them identically. A conversation might look like:

[VISUAL_TOKEN_0]...[VISUAL_TOKEN_255][TEXT_TOKEN_USER_QUESTION]...[TEXT_TOKEN_RESPONSE]

The LLM doesn’t “know” which tokens are visual—they’re all just embeddings. The semantic understanding of visual content emerges from the alignment learned during training.

How Images Become Tokens: The Tokenization Pipeline

Let’s trace a concrete example: processing a 336×336 image in LLaVA-1.5.

Step 1: Patch Embedding

Input: 336×336×3 RGB image
Patch size: 14×14
Number of patches: (336/14)² = 576 patches
Output: [1, 576, 1024] vision features

Step 2: Projector Transformation

Input: [1, 576, 1024] vision features
MLP projection
Output: [1, 576, 4096] visual tokens (matching Vicuna's embedding dim)

Step 3: Token Concatenation

Visual tokens: [1, 576, 4096]
Text tokens for "What's in this image?": [1, 5, 4096]
Combined sequence: [1, 581, 4096]

The LLM processes this 581-token sequence normally, with attention operating across both visual and text tokens.

Fusion Strategies: Early vs Cross-Attention

VLMs employ two primary fusion strategies with distinct trade-offs:

Early Fusion (Concatenation)

Used by LLaVA, InternVL, Qwen-VL. Visual tokens are prepended or inserted into the text sequence:

Advantages:

  • Simplicity—no architectural changes to the LLM
  • Full bidirectional attention between all modalities
  • Easy to implement and scale

Disadvantages:

  • Quadratic attention complexity: $O((n_{vision} + n_{text})^2)$
  • Limited context window: 576 visual tokens eat into the 4096 token budget

Cross-Attention Fusion

Used by Flamingo, IDEFICS. Visual tokens are accessed through dedicated cross-attention layers:

class GatedCrossAttentionLayer(nn.Module):
    def forward(self, text_features, vision_features):
        # Text queries attend to vision keys/values
        cross_attn_out = self.cross_attention(
            text_features, vision_features, vision_features
        )
        # Gating mechanism for training stability
        gate = torch.tanh(self.gate_param)
        return text_features + gate * cross_attn_out

Advantages:

  • Attention complexity: $O(n_{text} \times n_{vision})$
  • Can handle unlimited visual tokens
  • Preserves text-only capabilities better

Disadvantages:

  • Requires modifying LLM architecture
  • More complex training
  • Less bidirectional interaction

The Two-Stage Training Pipeline

VLM training follows a carefully designed curriculum:

Stage 1: Vision-Language Alignment

Goal: Connect the pre-trained vision encoder to the pre-trained LLM.

Data: Large-scale image-text pairs (millions to billions).

What’s trained: Only the projector (typically MLP). Vision encoder and LLM remain frozen.

Loss: Next-token prediction on image captions:

$$\mathcal{L} = -\sum_{t=1}^{T}\log P(x_t | v, x_{Where $v$ represents visual tokens and $x$ represents text tokens.

Stage 2: Visual Instruction Tuning

Goal: Teach the model to follow multimodal instructions.

Data: Curated instruction-response pairs, often generated by GPT-4:

{
  "image": "example.jpg",
  "conversations": [
    {"role": "user", "content": "<image>\nDescribe this scene."},
    {"role": "assistant", "content": "The image shows..."}
  ]
}

What’s trained: Projector + LLM (often with LoRA). Vision encoder typically remains frozen.

Loss: Standard instruction tuning loss on response tokens.

This staged approach is crucial—jumping directly to instruction tuning leads to catastrophic forgetting and poor alignment.

Handling Variable Resolutions: A Modern Challenge

Traditional VLMs resize all images to a fixed resolution (e.g., 224×224), losing detail for high-resolution inputs and wasting compute on small images.

LLaVA-NeXT: AnyRes Strategy

LLaVA-NeXT introduces dynamic resolution handling:

  1. Aspect Ratio Matching: Select the best aspect ratio from a predefined grid (e.g., {1:1, 1:2, 2:1, …})
  2. Tile-Based Processing: Split the image into tiles matching the base resolution
  3. Base Thumbnail: Always include a downsampled version for global context

For a 1000×500 image:

  • Selected ratio: 2:1
  • Tiles: Two 336×336 crops
  • Base: One 336×168 thumbnail
  • Total tokens: 3 × 576 = 1728 visual tokens

Qwen2-VL: Naive Dynamic Resolution

Qwen2-VL takes a more radical approach—no resizing at all:

def naive_dynamic_resolution(image):
    h, w = image.shape[:2]
    # Calculate patches dynamically
    patch_h = h // 14  # 14 is patch size
    patch_w = w // 14
    # Produce exactly (patch_h * patch_w) visual tokens
    # A 1080p image produces ~11,000 tokens
    return encode_patches(image, patch_h, patch_w)

This preserves all image detail but requires sophisticated memory management for extreme resolutions.

Multimodal Positional Encoding: mRoPE

Standard RoPE encodes 1D sequence positions. For images, we need 2D spatial awareness. Qwen2-VL introduces Multimodal RoPE (mRoPE):

$$\text{mRoPE}(t, h, w) = \text{RoPE}_t(t) \oplus \text{RoPE}_h(h) \oplus \text{RoPE}_w(w)$$

Where $t$ is temporal position (for video), and $h, w$ are spatial coordinates. This unified encoding works for both text (where $h=w=0$) and images, enabling seamless multimodal attention.

Leading Architectures Compared

Model Vision Encoder Projector LLM Backbone Key Innovation
LLaVA-1.5 CLIP-ViT-L/14 2-layer MLP Vicuna-7/13B Simple, effective baseline
LLaVA-NeXT CLIP-ViT-L/14 MLP Vicuna/LLaMA AnyRes for high-res images
InternVL-1.5 InternViT-6B MLP InternLM2-20B Large-scale vision encoder
Qwen2-VL SigLIP-ViT MLP Qwen2-7B Naive dynamic resolution, mRoPE
IDEFICS-2 SigLIP-SO400M Perceiver Mistral-7B Cross-attention fusion
NVLM-D SigLIP MLP Qwen2 Decoder-only architecture

The Vision Encoder Scale Paradox:

Interestingly, scaling the vision encoder yields diminishing returns. A 6B parameter vision encoder (InternVL) provides only marginal improvements over a 400M encoder (SigLIP). The bottleneck is often the projector and training data quality, not vision encoder capacity.

The Hallucination Problem

VLMs suffer from a unique failure mode: hallucinating objects not present in the image. Studies show grounding objectives—explicitly training models to connect text spans to image regions—have surprisingly little effect on reducing hallucination.

Root Causes:

  1. Training Bias: Models learn statistical priors (“dogs are usually near grass”) that override visual evidence
  2. Attention Dilution: With 500+ visual tokens, attention spreads thin
  3. Resolution Loss: Fine-grained details disappear in patch embedding

Mitigation Strategies:

  • Object-level Verification: Post-hoc verification using object detection
  • Contrastive Decoding: Penalizing generations that contradict visual evidence
  • Attention Visualization: Identifying when models “look at” relevant regions

Performance Benchmarks (2024-2025)

Model MMMU (val) MathVista DocVQA TextVQA
GPT-4o 69.1 63.8 92.8 -
Claude 3.5 Sonnet 68.3 67.7 - -
InternVL2-76B 58.2 58.9 94.1 83.6
Qwen2-VL-72B 51.1 61.5 96.5 85.4
LLaVA-OneVision-72B 52.0 60.0 88.7 79.2

The gap between open-source and proprietary models has narrowed dramatically, with InternVL2 approaching GPT-4o on several benchmarks.

The Road Ahead

VLMs continue to evolve rapidly. Key research directions include:

Encoder-Free Architectures: Models like Fuyu and Mono-InternVL eliminate separate vision encoders entirely, processing raw pixels through the LLM. This removes alignment overhead but requires significantly more training.

Native Multimodal Pre-training: Instead of connecting pre-trained components, models like Gemini and GPT-4V likely train on multimodal data from scratch, enabling deeper integration.

Video Understanding: Extending temporal modeling beyond frame-by-frame processing to capture motion and event sequences natively.

Efficient Inference: Reducing visual token counts through learned compression (e.g., LLaVA-UHD’s progressive compression) without sacrificing quality.

The architecture behind VLMs represents a pragmatic triumph: we didn’t need to reinvent language models to understand images. We needed a clever translation layer that speaks both languages fluently. As these systems scale, the distinction between text and vision in AI may ultimately dissolve—not because we built a unified architecture from scratch, but because we built the right bridge.