When 1.58 Bits Beats 16: How Ternary Weights Are Rewriting the Mathematics of LLM Efficiency

The mathematics of neural networks has long been considered settled: gradients flow through continuous-valued weights, optimized via backpropagation through floating-point arithmetic. Yet in February 2024, Microsoft Research challenged this orthodoxy with a question that seemed absurd: what if every weight in a large language model could be expressed using only three values—{-1, 0, 1}? The answer, it turns out, rewrites everything we thought we knew about the efficiency-accuracy trade-off. BitNet b1.58, trained natively with ternary weights, matches full-precision LLaMA models in perplexity while consuming 90% less memory. QuEST demonstrates that LLMs can be trained stably at 1-bit precision. NanoQuant pushes further, achieving sub-1-bit compression that runs a 70B model on a consumer 8GB GPU. ...

11 min · 2244 words

When Your Phone Becomes the Datacenter: The Engineering Revolution Behind On-Device LLMs

The smartphone in your pocket has more computing power than the entire NASA control room that guided Apollo 11 to the Moon. Yet until 2024, running a useful language model entirely on that device seemed like science fiction. The revolution that made it possible wasn’t a single breakthrough—it was a cascade of engineering innovations that fundamentally rethought how neural networks run on constrained hardware. The Memory Bandwidth Abyss The first and most brutal constraint facing on-device LLMs isn’t compute—it’s data movement. When you run a 7-billion parameter model on an H100 GPU, you’re working with memory bandwidth of 3.35 TB/s. A flagship smartphone in 2026? You get 50-90 GB/s through its LPDDR5X memory. That’s a 30-50x gap, and it dominates every architectural decision. ...

13 min · 2573 words

Training Trillion-Parameter Models: The Distributed Systems Architecture Behind Modern LLMs

When GPT-4 was released in 2023, rumors suggested it contained over 1.7 trillion parameters. Training such a model requires approximately 25,000 A100 GPUs running for months—a feat that would be impossible without sophisticated distributed training systems. The challenge isn’t merely computational; it’s fundamentally a memory problem. A single 80GB A100 GPU can barely hold a 40B parameter model during training, let alone a trillion-parameter behemoth. This is the story of how systems researchers cracked the memory wall through a decade of innovations in data parallelism, ZeRO, tensor parallelism, and pipeline parallelism. ...

10 min · 1974 words