Knowledge Retention

The mathematics of neural network pruning has been studied since the 1980s, when Yann LeCun demonstrated that optimal brain damage could remove redundant weights without harming performance. Yet for decades, pruning remained a niche technique—the computational savings rarely justified the engineering effort. Large Language Models changed everything. A 70-billion parameter model requires approximately 140 GB of memory just to store weights in FP16. At 50% sparsity, that drops to 70 GB—but only if your inference engine can efficiently skip the zero weights. The potential savings are measured in hundreds of thousands of dollars per deployment. The question is no longer whether to prune, but how to do it without destroying the model’s capabilities. ...