When 90% of Your KV Cache Doesn't Matter: The Mathematics Behind Intelligent Token Eviction

A 70B parameter model with a 128K context window needs approximately 40 GB of GPU memory just for the KV cache. That’s before counting model weights, activations, or any other overhead. TheKV cache grows linearly with sequence length, creating a fundamental barrier to long-context inference that no amount of GPU memory can solve. The breakthrough came from an unexpected observation: most tokens in your KV cache contribute almost nothing to the final output. Researchers discovered that intelligently evicting 90% of cached tokens often results in negligible accuracy loss. This isn’t compression through quantization—it’s compression through understanding which tokens actually matter. ...

7 min · 1331 words

The Hidden Memory Tax: Why Your 80GB GPU Still Can't Handle Long-Context LLMs

In March 2024, a team of researchers attempted to deploy a 70-billion parameter language model on a single NVIDIA H100 GPU with 80GB of VRAM. The model weights alone consumed approximately 140GB in FP16—already exceeding their hardware capacity. But even after applying 4-bit quantization to squeeze the weights down to ~40GB, the system still ran out of memory when processing contexts beyond 8,000 tokens. The culprit wasn’t the model size. It was something far more insidious: the KV cache. ...

9 min · 1846 words