The Inference Engine Wars: How SGLang, vLLM, and LMDeploy Are Redefining LLM Production Deployment in 2026

The LLM serving landscape has fundamentally shifted. What was once a simple choice between HuggingFace Transformers and early optimization frameworks has evolved into a sophisticated ecosystem where three engines dominate: SGLang, vLLM, and LMDeploy. The throughput gap between them—up to 29%—translates to tens of thousands of dollars in monthly GPU costs at production scale.

This isn’t just about speed. Each engine embodies a fundamentally different philosophy about how to solve the same problems: memory fragmentation, computation redundancy, and the tension between latency and throughput. Understanding these architectures is essential for making the right deployment decision.

The Memory Problem That Started Everything

Before diving into the engines, we need to understand what they’re solving. Traditional LLM inference allocates contiguous memory blocks for each sequence’s KV cache. This seems straightforward until you realize the consequences:

Fragmentation: When sequences of varying lengths complete, they free scattered memory chunks. New sequences can’t use this space efficiently.
Overallocation: You must reserve memory for maximum possible sequence length, wasting 60-80% of GPU memory on average.
Throughput ceiling: Memory waste limits concurrent requests, capping throughput well below hardware capability.

The numbers are stark. A 7B parameter model with 128K context window requires reserving approximately 16GB per sequence just for KV cache. With traditional allocation, you might fit 2-3 concurrent requests on an A100 80GB GPU. The same hardware with proper memory management handles 20+ requests.

This is the problem each engine approaches differently.

vLLM: The Production Standard with PagedAttention

vLLM emerged from UC Berkeley’s Sky Computing Lab with a simple but revolutionary idea: treat the KV cache like operating system virtual memory.

PagedAttention Architecture

Instead of allocating one large contiguous block per sequence, vLLM splits the KV cache into fixed-size pages (typically 16 tokens each). When a sequence needs more cache, vLLM assigns the next available page—anywhere in GPU memory.

# Traditional approach: contiguous allocation
sequence_cache = allocate_contiguous(max_seq_len * kv_cache_size)
# Waste: (max_seq_len - actual_len) * kv_cache_size

# PagedAttention: on-demand paging
pages = []
for token in generated_tokens:
    if needs_new_page():
        pages.append(get_next_free_page())

The memory utilization improvement is dramatic: from ~30% to over 95%. This translates to 2-4x more concurrent requests on identical hardware.

Key Performance Characteristics

On A100 80GB GPUs running Llama 2 7B Chat:

Metric	Traditional	vLLM PagedAttention
Memory Utilization	30-40%	95%+
Concurrent Requests	2-3	15-20
Throughput	Baseline	2-4x improvement

The continuous batching mechanism—allowing new requests to join running batches immediately rather than waiting for batch windows—further reduces average latency under load.

When vLLM Excels

vLLM’s architecture optimizes for memory-constrained environments and high-concurrency batch processing. The predictable page allocation simplifies capacity planning. The mature ecosystem supports 100+ model architectures out of the box.

However, raw throughput isn’t vLLM’s strength. On H100 GPUs with Llama 3.1 8B, vLLM achieves approximately 12,500 tokens per second—respectable but trailing competitors by significant margins.

SGLang: RadixAttention and the Prefix Reuse Revolution

SGLang, developed by LMSYS (the team behind Chatbot Arena), asked a different question: why compute the same prefix repeatedly across different requests?

The Prefix Reuse Opportunity

Consider a customer support chatbot. Every request shares:

System prompt (~500 tokens)
Few-shot examples (~2,000 tokens)
Conversation history (variable, potentially 10K+ tokens)

Traditional engines—including vLLM with basic PagedAttention—recompute the entire prefix for each new request. SGLang’s RadixAttention eliminates this redundancy.

RadixAttention Implementation

RadixAttention uses a radix tree (also called a compact trie) to store shared prefixes and their associated KV cache tensors:

Radix Tree Structure:
                    [root]
                   /      \
          "You are a"    "Translate"
              |              |
     "helpful assistant"   "this text"
              |              |
       (KV cache ptr)   (KV cache ptr)

When a new request arrives:

Prefix matching: Traverse the radix tree to find the longest matching prefix
Cache reuse: Directly use cached KV tensors without recomputation
Incremental computation: Only process the new tokens beyond the matched prefix
Tree update: Insert new unique portions into the tree for future reuse

The LRU eviction policy removes least-recently-used leaves when memory pressure requires it, while a cache-aware scheduler maximizes hit rates.

Performance Impact

The throughput gains depend entirely on prefix overlap:

Workload Type	Cache Hit Rate	Performance Gain
Multi-turn chat (shared history)	75-90%	10-20% faster
Few-shot learning (shared examples)	85-95%	Up to 5x faster
Code analysis (common patterns)	60-80%	3-4x faster
Single requests (no sharing)	0%	Equivalent to baseline

On H100 GPUs with Llama 3.1 8B, SGLang delivers 16,200 tokens per second—29% higher than vLLM. Time to first token drops dramatically when cache hits occur (79ms vs 102ms on vLLM without cache).

Zero-Overhead Scheduling

SGLang v0.4 introduced a critical optimization: CPU scheduling overhead reduced from 15-25% to under 2% of total inference time. The scheduler runs on a dedicated CPU thread without blocking GPU computation, enabling true zero-overhead batch management.

When SGLang Excels

Multi-turn conversations: Chatbots, dialogue systems, planning agents
Agentic workflows: Repeated template prefixes across reasoning steps
RAG applications: Shared document context across multiple queries
Maximum throughput: When every percentage point matters

SGLang now powers over 400,000 GPUs globally, including xAI’s Grok 3 and Microsoft Azure’s DeepSeek R1 deployment on AMD hardware.

LMDeploy: C++ Performance for Quantized Models

LMDeploy takes a fundamentally different approach. While vLLM and SGLang are Python-first with native kernels for hot paths, LMDeploy’s TurboMind engine is pure C++.

Why C++ Matters

Python’s Global Interpreter Lock (GIL) and interpreter overhead create measurable latency penalties in tight inference loops. For applications demanding sub-100ms response times, these microseconds accumulate.

TurboMind eliminates interpreter overhead entirely:

Persistent batching in pure C++
Blocked KV cache management
Optimized CUDA kernels for attention
Native weight-only quantization

Quantization Leadership

LMDeploy excels specifically with quantized models:

Quantization	LMDeploy Speedup vs FP16
Int4	2.4x faster
Int8	1.6x faster
FP8	1.8x faster

The Int4 performance is particularly notable: a 70B model fits on a single A100 80GB GPU with acceptable quality degradation, something impossible at FP16 precision.

Throughput Benchmarks

On A100 80GB with Llama 3 70B (Int4 quantization, 100 concurrent users):

Metric	LMDeploy	vLLM	SGLang
Throughput	700 tok/s	~500 tok/s	~600 tok/s
TTFT	Lowest	Higher	Moderate
Memory Efficiency	Excellent	Good	Good

LMDeploy maintains the lowest time-to-first-token across all concurrency levels, critical for latency-sensitive applications.

When LMDeploy Excels

Quantized model serving: Best-in-class Int4/Int8 performance
Memory-constrained deployments: Fit larger models on smaller GPUs
Latency-sensitive applications: Minimum TTFT requirement
NVIDIA-specific environments: Optimized CUDA implementation

Comparative Architecture Deep Dive

Understanding the architectural differences explains when each engine wins.

Memory Management Strategies

Aspect	vLLM	SGLang	LMDeploy
Core Mechanism	PagedAttention	RadixAttention	Blocked Cache
Allocation	Fixed pages (16 tokens)	Dynamic tree nodes	Custom blocks
Reuse Strategy	Per-sequence	Cross-sequence prefix	Per-sequence
Fragmentation	Minimal	Minimal	Minimal
Peak Efficiency	Memory-bound	Compute-bound	Both

Concurrency Handling

graph TD
    A[Incoming Requests] --> B{Batch Manager}
    B --> C[Continuous Batching]
    C --> D{Memory Pressure?}
    D -->|No| E[Execute Batch]
    D -->|Yes| F{Eviction Policy}
    F -->|vLLM| G[Release Oldest Pages]
    F -->|SGLang| H[LRU Tree Pruning]
    F -->|LMDeploy| I[Aggressive Quantization]

vLLM handles memory pressure through page eviction within sequences. SGLang may evict cached prefixes from the radix tree under pressure. LMDeploy proactively uses quantization to avoid pressure.

Latest Breakthroughs: Beyond Core Architectures

The field evolves rapidly. Three recent developments merit attention.

Attention Matching: 50x KV Cache Compression

MIT researchers (February 2026) introduced Attention Matching, achieving up to 50x KV cache compression in seconds without accuracy loss.

The technique constructs compact keys and values that reproduce attention outputs at a per-KV-head level. Unlike previous approaches requiring hours of GPU training (like Cartridges), Attention Matching:

Achieves 50x compression in seconds
Maintains near-identical accuracy
Decomposes into simple subproblems with closed-form solutions
Pushes the Pareto frontier of compaction time vs quality

For production systems, this means the memory tax of long-context inference drops dramatically. A 128K context that previously required 16GB KV cache could fit in 320MB.

FastKV: Decoupling Prefill and Decoding

FastKV addresses a fundamental limitation of prior KV cache compression methods: the coupling between prefill compute reduction and decoding KV budget.

The key insight: early layers require full-context processing to capture diverse token dependencies, but later layers converge on small, stable subsets of important tokens.

FastKV introduces:

Token-Selective Propagation (TSP): A dedicated layer forwards only salient tokens to subsequent layers
Independent KV retention: Each layer independently selects which KV entries to cache, decoupled from prefill decisions

Results on LLaMA-3.1-8B-Instruct:

1.82x faster prefill compared to full-context baseline
2.87x faster decoding
<1% accuracy drop on LongBench benchmark

This decoupling enables flexible optimization: conservative prefill reduction for accuracy, aggressive KV cache compression for decoding efficiency.

FreeKV: Retrieval-Based Compression

FreeKV introduces a retrieval-based approach to KV cache compression, addressing limitations of both dropping-based and retrieval-based methods.

The core innovation: efficient KV cache retrieval that maintains high compression ratios while preserving critical context. Early results show competitive performance with significantly reduced memory footprint.

Performance Summary: H100 Benchmarks

Testing methodology: Llama 3.1 8B Instruct, H100 80GB GPU, standardized configuration.

Metric	vLLM	SGLang	LMDeploy
Total Throughput	12,553 tok/s	16,215 tok/s	16,100 tok/s
Output Throughput	413 tok/s	894 tok/s	~800 tok/s
TTFT	102.65 ms	79.42 ms	Lowest
ITL	7.14 ms	6.03 ms	~6 ms
Concurrency Stability	Good	Excellent	Excellent

The 29% throughput gap between SGLang/LMDeploy and vLLM represents substantial cost differences. At one million daily requests with average 500-token responses:

vLLM: ~12 hours to process
SGLang: ~9 hours to process
Monthly GPU savings: ~$15,000 at current cloud rates

Decision Framework

Choose vLLM When:

First production deployment: Mature ecosystem, extensive documentation
Broad model compatibility: Supports any HuggingFace Transformers model
Hardware flexibility: TPU, AWS Trainium, Intel Gaudi, AMD ROCm
Memory-constrained environments: PagedAttention excels here
Team prioritizes stability: Largest community, most battle-tested

Choose SGLang When:

Multi-turn conversations: Chatbots, dialogue systems, AI agents
Maximum throughput is critical: 29% advantage compounds at scale
DeepSeek models: MLA-optimized kernels for best performance
Structured output generation: Native JSON/XML support
Prefix reuse is significant: RAG, few-shot learning, code analysis

Choose LMDeploy When:

Quantized model serving: Best Int4/Int8/FP8 performance
Memory-constrained hardware: Fit larger models via compression
Latency-sensitive applications: Lowest TTFT across engines
NVIDIA-specific deployment: Optimized CUDA implementation
Real-time applications: Gaming, AR/VR, robotics

Decision Matrix

                     vLLM    SGLang   LMDeploy
Raw Throughput        Good    Best      Excellent
Multi-turn Perf       Good    Best      Good
Memory Efficiency     Best    Good      Excellent
Model Ecosystem       Best    Good      Moderate
Hardware Support      Best    Moderate  Moderate
Quantization          Good    Good      Best
Setup Simplicity      Best    Good      Moderate
Community Size        Largest Growing   Moderate

Production Deployment Considerations

Monitoring and Observability

All three engines expose Prometheus metrics, but depth varies:

vLLM: Most comprehensive metrics, well-documented Grafana dashboards
SGLang: Core metrics available, community dashboards emerging
LMDeploy: Basic metrics, requires custom instrumentation for deep monitoring

Key metrics to track:

KV cache utilization percentage
Request queue depth
Time-to-first-token distribution
Tokens-per-second throughput
GPU memory fragmentation

Kubernetes Integration

# Example vLLM deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        resources:
          limits:
            nvidia.com/gpu: 1
        args:
        - --model
        - meta-llama/Llama-3.1-8B-Instruct
        - --gpu-memory-utilization
        - "0.9"

SGLang and LMDeploy follow similar patterns with engine-specific arguments.

Multi-Engine Strategy

Sophisticated deployments use multiple engines for different workloads:

SGLang for chat: Multi-turn conversations benefit from RadixAttention
vLLM for batch: High-throughput content generation with broad model support
LMDeploy for edge: Quantized models on memory-constrained hardware

Load balancers route requests based on workload characteristics:

Session ID → SGLang (prefix reuse opportunity)
Batch job → vLLM (memory efficiency)
Edge device → LMDeploy (quantization)

What’s Next

The inference engine landscape continues evolving. Key trends to watch:

Unified APIs: OpenAI-compatible interfaces are now standard, reducing switching costs
Hardware diversity: AMD MI300X, Intel Gaudi, AWS Trainium support expanding
KV cache compression: Attention Matching and FastKV show 50x+ compression is achievable
Speculative decoding: EAGLE-3 and Medusa achieving 2-3x latency improvements
Disaggregated inference: Separating prefill and decode stages for optimal resource utilization

The Bottom Line

The 29% throughput difference between engines isn’t just a benchmark number—it’s real money at production scale. But raw throughput tells only part of the story.

vLLM remains the safest choice for teams prioritizing ecosystem maturity, hardware flexibility, and ease of deployment. SGLang dominates for multi-turn conversations and maximum throughput. LMDeploy wins for quantized models and latency-sensitive applications.

The right choice depends on your specific workload patterns. Benchmark all three with your actual prompts, models, and hardware. Published benchmarks provide guidance, but your traffic patterns determine which optimizations matter.

Start with vLLM for initial deployment, then optimize. The migration cost between engines—with their OpenAI-compatible APIs—is minimal compared to the potential performance gains from matching engine to workload.

The Memory Problem That Started Everything#

vLLM: The Production Standard with PagedAttention#

PagedAttention Architecture#

Key Performance Characteristics#

When vLLM Excels#

SGLang: RadixAttention and the Prefix Reuse Revolution#

The Prefix Reuse Opportunity#

RadixAttention Implementation#

Performance Impact#

Zero-Overhead Scheduling#

When SGLang Excels#

LMDeploy: C++ Performance for Quantized Models#

Why C++ Matters#

Quantization Leadership#

Throughput Benchmarks#

When LMDeploy Excels#

Comparative Architecture Deep Dive#

Memory Management Strategies#

Concurrency Handling#

Latest Breakthroughs: Beyond Core Architectures#

Attention Matching: 50x KV Cache Compression#

FastKV: Decoupling Prefill and Decoding#

FreeKV: Retrieval-Based Compression#

Performance Summary: H100 Benchmarks#

Decision Framework#

Choose vLLM When:#

Choose SGLang When:#

Choose LMDeploy When:#

Decision Matrix#

Production Deployment Considerations#

Monitoring and Observability#

Kubernetes Integration#

Multi-Engine Strategy#

What’s Next#

The Bottom Line#