Quantization

When 1.58 Bits Beats 16: How Ternary Weights Are Rewriting the Mathematics of LLM Efficiency

The mathematics of neural networks has long been considered settled: gradients flow through continuous-valued weights, optimized via backpropagation through floating-point arithmetic. Yet in February 2024, Microsoft Research challenged this orthodoxy with a question that seemed absurd: what if every weight in a large language model could be expressed using only three values—{-1, 0, 1}? The answer, it turns out, rewrites everything we thought we knew about the efficiency-accuracy trade-off. BitNet b1.58, trained natively with ternary weights, matches full-precision LLaMA models in perplexity while consuming 90% less memory. QuEST demonstrates that LLMs can be trained stably at 1-bit precision. NanoQuant pushes further, achieving sub-1-bit compression that runs a 70B model on a consumer 8GB GPU. ...

How 4 Bits Preserves 99% Quality: The Mathematics Behind LLM Quantization

A 70-billion parameter model requires 140 GB of GPU memory in FP16. A consumer RTX 4090 has 24 GB. This arithmetic gap defined the boundary between “enterprise AI” and “what you can run at home” until quantization mathematics cracked the code. The counterintuitive reality: reducing precision from 16 bits to 4 bits—a 75% compression—often preserves over 95% of model quality. Not through magic, but through a profound understanding of how neural networks encode information. ...

How JPEG Compression Actually Works: The Mathematics Behind Every Photo

In September 1992, a committee called the Joint Photographic Experts Group published a standard that would fundamentally change how humanity stores and shares images. The JPEG format, based on the discrete cosine transform (DCT), made digital photography practical by reducing file sizes by a factor of 10 while maintaining acceptable visual quality. Three decades later, JPEG remains the most widely used image format in the world, with billions of images created daily. ...