VLLM

In March 2024, a team of researchers attempted to deploy a 70-billion parameter language model on a single NVIDIA H100 GPU with 80GB of VRAM. The model weights alone consumed approximately 140GB in FP16—already exceeding their hardware capacity. But even after applying 4-bit quantization to squeeze the weights down to ~40GB, the system still ran out of memory when processing contexts beyond 8,000 tokens. The culprit wasn’t the model size. It was something far more insidious: the KV cache. ...

How Speculative Decoding Achieves 3x Faster LLM Inference Without Losing Quality: The Mathematics Behind Draft-Verify Acceleration

The Hidden Memory Tax: Why Your 80GB GPU Still Can't Handle Long-Context LLMs