The LLM serving landscape has fundamentally shifted. What was once a simple choice between HuggingFace Transformers and early optimization frameworks has evolved into a sophisticated ecosystem where three engines dominate: SGLang, vLLM, and LMDeploy. The throughput gap between them—up to 29%—translates to tens of thousands of dollars in monthly GPU costs at production scale.
This isn’t just about speed. Each engine embodies a fundamentally different philosophy about how to solve the same problems: memory fragmentation, computation redundancy, and the tension between latency and throughput. Understanding these architectures is essential for making the right deployment decision.
The Memory Problem That Started Everything
Before diving into the engines, we need to understand what they’re solving. Traditional LLM inference allocates contiguous memory blocks for each sequence’s KV cache. This seems straightforward until you realize the consequences:
- Fragmentation: When sequences of varying lengths complete, they free scattered memory chunks. New sequences can’t use this space efficiently.
- Overallocation: You must reserve memory for maximum possible sequence length, wasting 60-80% of GPU memory on average.
- Throughput ceiling: Memory waste limits concurrent requests, capping throughput well below hardware capability.
The numbers are stark. A 7B parameter model with 128K context window requires reserving approximately 16GB per sequence just for KV cache. With traditional allocation, you might fit 2-3 concurrent requests on an A100 80GB GPU. The same hardware with proper memory management handles 20+ requests.
This is the problem each engine approaches differently.
vLLM: The Production Standard with PagedAttention
vLLM emerged from UC Berkeley’s Sky Computing Lab with a simple but revolutionary idea: treat the KV cache like operating system virtual memory.
PagedAttention Architecture
Instead of allocating one large contiguous block per sequence, vLLM splits the KV cache into fixed-size pages (typically 16 tokens each). When a sequence needs more cache, vLLM assigns the next available page—anywhere in GPU memory.
# Traditional approach: contiguous allocation
sequence_cache = allocate_contiguous(max_seq_len * kv_cache_size)
# Waste: (max_seq_len - actual_len) * kv_cache_size
# PagedAttention: on-demand paging
pages = []
for token in generated_tokens:
if needs_new_page():
pages.append(get_next_free_page())
The memory utilization improvement is dramatic: from ~30% to over 95%. This translates to 2-4x more concurrent requests on identical hardware.
Key Performance Characteristics
On A100 80GB GPUs running Llama 2 7B Chat:
| Metric | Traditional | vLLM PagedAttention |
|---|---|---|
| Memory Utilization | 30-40% | 95%+ |
| Concurrent Requests | 2-3 | 15-20 |
| Throughput | Baseline | 2-4x improvement |
The continuous batching mechanism—allowing new requests to join running batches immediately rather than waiting for batch windows—further reduces average latency under load.
When vLLM Excels
vLLM’s architecture optimizes for memory-constrained environments and high-concurrency batch processing. The predictable page allocation simplifies capacity planning. The mature ecosystem supports 100+ model architectures out of the box.
However, raw throughput isn’t vLLM’s strength. On H100 GPUs with Llama 3.1 8B, vLLM achieves approximately 12,500 tokens per second—respectable but trailing competitors by significant margins.
SGLang: RadixAttention and the Prefix Reuse Revolution
SGLang, developed by LMSYS (the team behind Chatbot Arena), asked a different question: why compute the same prefix repeatedly across different requests?
The Prefix Reuse Opportunity
Consider a customer support chatbot. Every request shares:
- System prompt (~500 tokens)
- Few-shot examples (~2,000 tokens)
- Conversation history (variable, potentially 10K+ tokens)
Traditional engines—including vLLM with basic PagedAttention—recompute the entire prefix for each new request. SGLang’s RadixAttention eliminates this redundancy.
RadixAttention Implementation
RadixAttention uses a radix tree (also called a compact trie) to store shared prefixes and their associated KV cache tensors:
Radix Tree Structure:
[root]
/ \
"You are a" "Translate"
| |
"helpful assistant" "this text"
| |
(KV cache ptr) (KV cache ptr)
When a new request arrives:
- Prefix matching: Traverse the radix tree to find the longest matching prefix
- Cache reuse: Directly use cached KV tensors without recomputation
- Incremental computation: Only process the new tokens beyond the matched prefix
- Tree update: Insert new unique portions into the tree for future reuse
The LRU eviction policy removes least-recently-used leaves when memory pressure requires it, while a cache-aware scheduler maximizes hit rates.
Performance Impact
The throughput gains depend entirely on prefix overlap:
| Workload Type | Cache Hit Rate | Performance Gain |
|---|---|---|
| Multi-turn chat (shared history) | 75-90% | 10-20% faster |
| Few-shot learning (shared examples) | 85-95% | Up to 5x faster |
| Code analysis (common patterns) | 60-80% | 3-4x faster |
| Single requests (no sharing) | 0% | Equivalent to baseline |
On H100 GPUs with Llama 3.1 8B, SGLang delivers 16,200 tokens per second—29% higher than vLLM. Time to first token drops dramatically when cache hits occur (79ms vs 102ms on vLLM without cache).
Zero-Overhead Scheduling
SGLang v0.4 introduced a critical optimization: CPU scheduling overhead reduced from 15-25% to under 2% of total inference time. The scheduler runs on a dedicated CPU thread without blocking GPU computation, enabling true zero-overhead batch management.
When SGLang Excels
- Multi-turn conversations: Chatbots, dialogue systems, planning agents
- Agentic workflows: Repeated template prefixes across reasoning steps
- RAG applications: Shared document context across multiple queries
- Maximum throughput: When every percentage point matters
SGLang now powers over 400,000 GPUs globally, including xAI’s Grok 3 and Microsoft Azure’s DeepSeek R1 deployment on AMD hardware.
LMDeploy: C++ Performance for Quantized Models
LMDeploy takes a fundamentally different approach. While vLLM and SGLang are Python-first with native kernels for hot paths, LMDeploy’s TurboMind engine is pure C++.
Why C++ Matters
Python’s Global Interpreter Lock (GIL) and interpreter overhead create measurable latency penalties in tight inference loops. For applications demanding sub-100ms response times, these microseconds accumulate.
TurboMind eliminates interpreter overhead entirely:
- Persistent batching in pure C++
- Blocked KV cache management
- Optimized CUDA kernels for attention
- Native weight-only quantization
Quantization Leadership
LMDeploy excels specifically with quantized models:
| Quantization | LMDeploy Speedup vs FP16 |
|---|---|
| Int4 | 2.4x faster |
| Int8 | 1.6x faster |
| FP8 | 1.8x faster |
The Int4 performance is particularly notable: a 70B model fits on a single A100 80GB GPU with acceptable quality degradation, something impossible at FP16 precision.
Throughput Benchmarks
On A100 80GB with Llama 3 70B (Int4 quantization, 100 concurrent users):
| Metric | LMDeploy | vLLM | SGLang |
|---|---|---|---|
| Throughput | 700 tok/s | ~500 tok/s | ~600 tok/s |
| TTFT | Lowest | Higher | Moderate |
| Memory Efficiency | Excellent | Good | Good |
LMDeploy maintains the lowest time-to-first-token across all concurrency levels, critical for latency-sensitive applications.
When LMDeploy Excels
- Quantized model serving: Best-in-class Int4/Int8 performance
- Memory-constrained deployments: Fit larger models on smaller GPUs
- Latency-sensitive applications: Minimum TTFT requirement
- NVIDIA-specific environments: Optimized CUDA implementation
Comparative Architecture Deep Dive
Understanding the architectural differences explains when each engine wins.
Memory Management Strategies
| Aspect | vLLM | SGLang | LMDeploy |
|---|---|---|---|
| Core Mechanism | PagedAttention | RadixAttention | Blocked Cache |
| Allocation | Fixed pages (16 tokens) | Dynamic tree nodes | Custom blocks |
| Reuse Strategy | Per-sequence | Cross-sequence prefix | Per-sequence |
| Fragmentation | Minimal | Minimal | Minimal |
| Peak Efficiency | Memory-bound | Compute-bound | Both |
Concurrency Handling
graph TD
A[Incoming Requests] --> B{Batch Manager}
B --> C[Continuous Batching]
C --> D{Memory Pressure?}
D -->|No| E[Execute Batch]
D -->|Yes| F{Eviction Policy}
F -->|vLLM| G[Release Oldest Pages]
F -->|SGLang| H[LRU Tree Pruning]
F -->|LMDeploy| I[Aggressive Quantization]
vLLM handles memory pressure through page eviction within sequences. SGLang may evict cached prefixes from the radix tree under pressure. LMDeploy proactively uses quantization to avoid pressure.
Latest Breakthroughs: Beyond Core Architectures
The field evolves rapidly. Three recent developments merit attention.
Attention Matching: 50x KV Cache Compression
MIT researchers (February 2026) introduced Attention Matching, achieving up to 50x KV cache compression in seconds without accuracy loss.
The technique constructs compact keys and values that reproduce attention outputs at a per-KV-head level. Unlike previous approaches requiring hours of GPU training (like Cartridges), Attention Matching:
- Achieves 50x compression in seconds
- Maintains near-identical accuracy
- Decomposes into simple subproblems with closed-form solutions
- Pushes the Pareto frontier of compaction time vs quality
For production systems, this means the memory tax of long-context inference drops dramatically. A 128K context that previously required 16GB KV cache could fit in 320MB.
FastKV: Decoupling Prefill and Decoding
FastKV addresses a fundamental limitation of prior KV cache compression methods: the coupling between prefill compute reduction and decoding KV budget.
The key insight: early layers require full-context processing to capture diverse token dependencies, but later layers converge on small, stable subsets of important tokens.
FastKV introduces:
- Token-Selective Propagation (TSP): A dedicated layer forwards only salient tokens to subsequent layers
- Independent KV retention: Each layer independently selects which KV entries to cache, decoupled from prefill decisions
Results on LLaMA-3.1-8B-Instruct:
- 1.82x faster prefill compared to full-context baseline
- 2.87x faster decoding
- <1% accuracy drop on LongBench benchmark
This decoupling enables flexible optimization: conservative prefill reduction for accuracy, aggressive KV cache compression for decoding efficiency.
FreeKV: Retrieval-Based Compression
FreeKV introduces a retrieval-based approach to KV cache compression, addressing limitations of both dropping-based and retrieval-based methods.
The core innovation: efficient KV cache retrieval that maintains high compression ratios while preserving critical context. Early results show competitive performance with significantly reduced memory footprint.
Performance Summary: H100 Benchmarks
Testing methodology: Llama 3.1 8B Instruct, H100 80GB GPU, standardized configuration.
| Metric | vLLM | SGLang | LMDeploy |
|---|---|---|---|
| Total Throughput | 12,553 tok/s | 16,215 tok/s | 16,100 tok/s |
| Output Throughput | 413 tok/s | 894 tok/s | ~800 tok/s |
| TTFT | 102.65 ms | 79.42 ms | Lowest |
| ITL | 7.14 ms | 6.03 ms | ~6 ms |
| Concurrency Stability | Good | Excellent | Excellent |
The 29% throughput gap between SGLang/LMDeploy and vLLM represents substantial cost differences. At one million daily requests with average 500-token responses:
- vLLM: ~12 hours to process
- SGLang: ~9 hours to process
- Monthly GPU savings: ~$15,000 at current cloud rates
Decision Framework
Choose vLLM When:
- First production deployment: Mature ecosystem, extensive documentation
- Broad model compatibility: Supports any HuggingFace Transformers model
- Hardware flexibility: TPU, AWS Trainium, Intel Gaudi, AMD ROCm
- Memory-constrained environments: PagedAttention excels here
- Team prioritizes stability: Largest community, most battle-tested
Choose SGLang When:
- Multi-turn conversations: Chatbots, dialogue systems, AI agents
- Maximum throughput is critical: 29% advantage compounds at scale
- DeepSeek models: MLA-optimized kernels for best performance
- Structured output generation: Native JSON/XML support
- Prefix reuse is significant: RAG, few-shot learning, code analysis
Choose LMDeploy When:
- Quantized model serving: Best Int4/Int8/FP8 performance
- Memory-constrained hardware: Fit larger models via compression
- Latency-sensitive applications: Lowest TTFT across engines
- NVIDIA-specific deployment: Optimized CUDA implementation
- Real-time applications: Gaming, AR/VR, robotics
Decision Matrix
vLLM SGLang LMDeploy
Raw Throughput Good Best Excellent
Multi-turn Perf Good Best Good
Memory Efficiency Best Good Excellent
Model Ecosystem Best Good Moderate
Hardware Support Best Moderate Moderate
Quantization Good Good Best
Setup Simplicity Best Good Moderate
Community Size Largest Growing Moderate
Production Deployment Considerations
Monitoring and Observability
All three engines expose Prometheus metrics, but depth varies:
- vLLM: Most comprehensive metrics, well-documented Grafana dashboards
- SGLang: Core metrics available, community dashboards emerging
- LMDeploy: Basic metrics, requires custom instrumentation for deep monitoring
Key metrics to track:
- KV cache utilization percentage
- Request queue depth
- Time-to-first-token distribution
- Tokens-per-second throughput
- GPU memory fragmentation
Kubernetes Integration
# Example vLLM deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
spec:
replicas: 3
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
resources:
limits:
nvidia.com/gpu: 1
args:
- --model
- meta-llama/Llama-3.1-8B-Instruct
- --gpu-memory-utilization
- "0.9"
SGLang and LMDeploy follow similar patterns with engine-specific arguments.
Multi-Engine Strategy
Sophisticated deployments use multiple engines for different workloads:
- SGLang for chat: Multi-turn conversations benefit from RadixAttention
- vLLM for batch: High-throughput content generation with broad model support
- LMDeploy for edge: Quantized models on memory-constrained hardware
Load balancers route requests based on workload characteristics:
- Session ID → SGLang (prefix reuse opportunity)
- Batch job → vLLM (memory efficiency)
- Edge device → LMDeploy (quantization)
What’s Next
The inference engine landscape continues evolving. Key trends to watch:
- Unified APIs: OpenAI-compatible interfaces are now standard, reducing switching costs
- Hardware diversity: AMD MI300X, Intel Gaudi, AWS Trainium support expanding
- KV cache compression: Attention Matching and FastKV show 50x+ compression is achievable
- Speculative decoding: EAGLE-3 and Medusa achieving 2-3x latency improvements
- Disaggregated inference: Separating prefill and decode stages for optimal resource utilization
The Bottom Line
The 29% throughput difference between engines isn’t just a benchmark number—it’s real money at production scale. But raw throughput tells only part of the story.
vLLM remains the safest choice for teams prioritizing ecosystem maturity, hardware flexibility, and ease of deployment. SGLang dominates for multi-turn conversations and maximum throughput. LMDeploy wins for quantized models and latency-sensitive applications.
The right choice depends on your specific workload patterns. Benchmark all three with your actual prompts, models, and hardware. Published benchmarks provide guidance, but your traffic patterns determine which optimizations matter.
Start with vLLM for initial deployment, then optimize. The migration cost between engines—with their OpenAI-compatible APIs—is minimal compared to the potential performance gains from matching engine to workload.