A 70B parameter model with a 128K context window needs approximately 40 GB of GPU memory just for the KV cache. That’s before counting model weights, activations, or any other overhead. TheKV cache grows linearly with sequence length, creating a fundamental barrier to long-context inference that no amount of GPU memory can solve.
The breakthrough came from an unexpected observation: most tokens in your KV cache contribute almost nothing to the final output. Researchers discovered that intelligently evicting 90% of cached tokens often results in negligible accuracy loss. This isn’t compression through quantization—it’s compression through understanding which tokens actually matter.
The Mathematics of KV Cache Memory
Before diving into eviction strategies, let’s establish the memory footprint formula. For a transformer model with $L$ layers, $h$ attention heads, head dimension $d_h$, and sequence length $n$:
$$\text{KV Cache Size} = 2 \times L \times n \times h \times d_h \times \text{bytes\_per\_param}$$The factor of 2 accounts for separate Key and Value matrices. For Llama-2-70B running at FP16 precision:
- $L = 80$ layers
- $h = 64$ heads (GQA reduces KV heads to 8)
- $d_h = 128$ head dimension
- 2 bytes (FP16)
At 128K tokens: $2 \times 80 \times 131,072 \times 8 \times 128 \times 2 \approx 43$ GB
This explains why running long-context models feels like filling a bucket with a hole—the longer your sequence, the faster memory disappears.
The Attention Sink Phenomenon
In 2023, researchers at MIT discovered something puzzling: when they removed initial tokens from the KV cache, model performance collapsed catastrophically. These tokens weren’t semantically important—they were often just BOS (Beginning of Sequence) tokens or padding. Yet they attracted disproportionate attention across all layers and heads.
The root cause lies in softmax normalization. The attention mechanism computes:
$$\text{Attention}(Q, K) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$Softmax forces all attention weights to sum to 1. When a token receives very high pre-softmax scores, it becomes an “attention sink”—absorbing excess attention that would otherwise distribute across other tokens. Initial tokens, being visible to all subsequent positions under causal masking, naturally evolve into these sinks during training.
A 2024 ICLR paper demonstrated that over 50% of attention weight often concentrates on the first few tokens, even when they carry no semantic meaning. Removing them doesn’t just lose information—it breaks the normalization structure the model learned to rely on.
This discovery has profound implications for KV cache eviction: you cannot simply use LRU or FIFO policies. The most recently accessed token might be semantically irrelevant, while a seemingly “unused” initial token might be structurally essential.
H₂O: The Heavy-Hitter Oracle
The H₂O (Heavy-Hitter Oracle) paper from 2023 formulated KV cache eviction as a dynamic submodular optimization problem. Their key insight: a small subset of tokens consistently accumulate high attention scores across all query positions.
These “heavy hitters” aren’t random—they strongly correlate with:
- Frequently co-occurring tokens in the training corpus
- Tokens with high TF-IDF scores in the current context
- Structurally important positions (sentence boundaries, list items)
H₂O maintains a hybrid cache:
- Recent tokens: A sliding window of the most recent $k$ tokens
- Heavy hitters: Tokens with highest cumulative attention scores
The eviction policy is elegantly simple:
def h2o_evict(cache, budget):
recent_budget = budget // 2
heavy_budget = budget // 2
# Keep most recent tokens
recent_tokens = cache[-recent_budget:]
# Keep heavy hitters by accumulated attention
attention_sums = compute_attention_accumulation(cache)
heavy_tokens = top_k(cache, heavy_budget, key=attention_sums)
return merge(recent_tokens, heavy_tokens)
On OPT-6.7B, H₂O achieved 29× throughput improvement over baseline inference systems with only 20% of the KV cache budget. The theoretical analysis proved that under mild assumptions, this greedy eviction strategy maintains bounded regret compared to an oracle that knows future attention patterns.
SnapKV: Predicting Importance Before Generation
H₂O’s limitation is its reactive nature—it only identifies heavy hitters after observing attention patterns. SnapKV (NeurIPS 2024) asked: can we predict which tokens will be important before generation begins?
The researchers discovered that each attention head exhibits consistent “attention patterns” during generation. These patterns can be approximated by looking at a small observation window at the end of the prompt.
Consider a prompt of 10,000 tokens. Instead of computing full attention, SnapKV:
- Takes the final 64 tokens as an observation window
- Computes attention between this window and all prompt tokens
- Clusters tokens by their attention scores
- Retains representative tokens from each cluster
The mathematics behind this relies on the observation that attention scores are locally stable—if a token is important for queries near the end of the prompt, it’s likely important for queries during generation.
Prompt: [P1, P2, ..., P9936] [OW1, OW2, ..., OW64]
↑
Observation Window
Attention from OW → all P tokens identifies important positions
Cluster similar patterns, retain cluster centroids
SnapKV achieved 3.6× faster generation and 8.2× memory efficiency on 16K token inputs. More impressively, it processed 380K context tokens on a single A100-80GB GPU with negligible accuracy drop on Needle-in-a-Haystack tests.
RocketKV: The 400× Compression Frontier
Published in early 2025, RocketKV pushed compression ratios further through a two-stage approach:
Stage 1: Coarse-Grain Permanent Eviction Apply aggressive permanent eviction on input sequence tokens using accumulated attention scores. This is “permanent” because once evicted, these tokens never return.
Stage 2: Fine-Grain Sparse Attention During decoding, use hybrid sparse attention that combines:
- Head-level reduction: Different heads get different cache budgets
- Sequence-level reduction: Top-k selection based on query-specific scores
The key innovation is separating the eviction decision from the attention computation. Stage 1 makes a one-time decision about prompt tokens, while Stage 2 continuously refines which generated tokens to keep.
RocketKV achieved:
- 400× compression ratio with negligible accuracy loss
- 3.7× end-to-end speedup
- 32.6% peak memory reduction
The paper also introduced a multi-turn variant that handles conversation history more gracefully, outperforming all previous methods on dialogue benchmarks.
PyramidKV: Layer-Aware Cache Allocation
Not all transformer layers are created equal. PyramidKV (2024) observed that lower layers tend to require more KV cache than higher layers.
This creates a pyramidal allocation pattern:
| Layer Range | Cache Budget | Rationale |
|---|---|---|
| Layers 0-15 | 100% | Captures fine-grained local patterns |
| Layers 16-31 | 70% | Abstracts local information |
| Layers 32-47 | 50% | Handles semantic composition |
| Layers 48-63 | 30% | Final task-specific processing |
The intuition: early layers process raw token information and need full context, while later layers work with already-abstracted representations where many tokens become redundant.
PyramidKV’s performance gains compound with other methods—it can be combined with H₂O or SnapKV for additional compression.
The Trade-offs No One Talks About
Every compression technique comes with costs that benchmarks often hide:
Latency vs. Throughput Trade-off KV cache compression improves memory efficiency but adds computational overhead for eviction decisions. For short sequences (<4K tokens), the overhead often exceeds the benefit.
Task-Specific Fragility Compression ratios that work for summarization may fail for code generation or mathematical reasoning. A 2025 NeurIPS paper found that reasoning models like o1 and DeepSeek-R1 are particularly sensitive to KV cache compression—aggressive eviction can break chain-of-thought reasoning.
Multi-Turn Complications In conversations, earlier turns often contain instructions that must be retained. Standard eviction policies may accidentally evict system prompts or few-shot examples.
Needle-in-Haystack Failure Modes Compression can create “blind spots” where rare but critical information gets evicted. A 100K context document with one crucial sentence might lose that sentence if it’s not identified as important during the observation window.
Practical Recommendations
Based on current research, here’s a decision framework:
-
Short contexts (<8K tokens): No compression needed; memory isn’t the bottleneck.
-
Medium contexts (8K-32K tokens): Start with H₂O at 50% budget. Add PyramidKV layer allocation for additional gains.
-
Long contexts (32K-128K tokens): Use SnapKV with observation window of 64-128 tokens. Consider RocketKV’s two-stage approach for maximum compression.
-
Very long contexts (>128K tokens): Combine quantization (FP8 or INT4) with eviction. The 2024 Hugging Face blog showed that 4-bit KV cache quantization plus 90% eviction can reduce memory by 40× with <2% accuracy loss.
-
Reasoning tasks: Be conservative. Use at least 30% cache budget and prefer attention-based eviction (H₂O/SnapKV) over simpler heuristics.
The field is evolving rapidly—new methods like CAKE (Cascading Adaptive KV Eviction) and FreeKV are pushing compression ratios even further. But the fundamental insight remains: understanding what tokens matter is more powerful than brute-force memory expansion.