From 1% Parameters to Full Capacity: The Mathematics and Engineering Behind LoRA's Evolution

Fine-tuning a 7-billion parameter model used to demand 100+ GB of VRAM—roughly the memory of four A100 GPUs. Today, the same task runs on a consumer RTX 4090 with 24 GB. This 4× reduction didn’t come from better hardware; it came from a mathematical insight about the structure of neural network adaptations. Low-Rank Adaptation (LoRA), introduced by Microsoft in 2021, fundamentally changed how we think about model fine-tuning. The core idea is deceptively simple: instead of updating all parameters, inject small trainable matrices that modify the model’s behavior. But behind this simplicity lies deep connections to linear algebra, information theory, and the geometry of neural network weight spaces. ...

4 min · 1660 words

When the Path Matters More Than the Answer: How Process Reward Models Transform LLM Reasoning

A math student solves a complex integration problem. Her final answer is correct, but halfway through, she made a sign error that accidentally canceled out in the next step. The teacher gives full marks—after all, the answer is right. But should it count? This scenario captures the fundamental flaw in how we’ve traditionally evaluated Large Language Model (LLM) reasoning: Outcome Reward Models (ORMs) only check the final destination, ignoring whether the path was sound. Process Reward Models (PRMs) represent a paradigm shift—verifying every step of reasoning, catching those hidden errors that coincidentally produce correct answers, and enabling the test-time scaling that powers reasoning models like OpenAI’s o1 and DeepSeek-R1. ...

7 min · 1473 words

When a 1B Model Beats a 405B Giant: How Test-Time Compute Is Rewriting the Rules of LLM Scaling

For years, the path to better LLMs seemed straightforward: more parameters, more training data, more compute. The scaling laws articulated by Kaplan et al. and refined by Chinchilla painted a clear picture—performance improved predictably with model size. Then OpenAI released o1, and suddenly the rules changed. A model that “thinks longer” at inference time was solving problems that eluded models 10x its size. The breakthrough wasn’t just engineering—it was a fundamental shift in how we think about compute allocation. The question flipped from “how big should we train?” to “how long should we let it think?” ...

9 min · 1722 words

How DeepSeek-R1 Learned to Think: The GRPO Algorithm Behind Open-Source Reasoning Models

On January 20, 2025, DeepSeek released R1—a 671B parameter Mixture-of-Experts model that achieved something remarkable: matching OpenAI’s o1 on reasoning benchmarks while being fully open-source. The breakthrough wasn’t just in scale or architecture, but in a fundamentally different approach to training reasoning capabilities: Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that eliminates the need for reward models while enabling sophisticated reasoning behaviors to emerge naturally. The Problem with Traditional LLM Training Standard large language models excel at pattern matching and next-token prediction, but struggle with tasks requiring multi-step logical deduction, self-correction, and complex problem decomposition. Chain-of-thought prompting helped, but it required extensive human-annotated demonstrations and still couldn’t match the systematic reasoning humans employ. ...

3 min · 472 words

Why Backpropagation Trains Neural Networks 10 Million Times Faster: The Mathematics Behind Deep Learning

In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published a paper in Nature that would transform artificial intelligence. The paper, “Learning representations by back-propagating errors,” demonstrated that a mathematical technique from the 1970s could train neural networks orders of magnitude faster than existing methods. The speedup wasn’t incremental—it was the difference between a model taking a week to train and taking 200,000 years. But backpropagation wasn’t invented in 1986. Its modern form was first published in 1970 by Finnish master’s student Seppo Linnainmaa, who described it as “reverse mode automatic differentiation.” Even earlier, Henry J. Kelley derived the foundational concepts in 1960 for optimal flight path calculations. What the 1986 paper achieved wasn’t invention—it was recognition. The authors demonstrated that this obscure numerical technique was exactly what neural networks needed. ...

9 min · 1712 words

Why Semantic Search Fails: The Hidden Geometry of Vector Embeddings

In 2013, Tomas Mikolov and his team at Google published a paper that would fundamentally change how machines understand language. They showed that by training a simple neural network to predict surrounding words, you could learn vector representations where “king” minus “man” plus “woman” approximately equals “queen.” This was the birth of modern word embeddings—a technique that compresses the meaning of words into dense numerical vectors. A decade later, embeddings have become the backbone of virtually every AI application involving text. They power semantic search, recommendation systems, and the retrieval component of RAG (Retrieval-Augmented Generation) architectures. But as organizations deploy these systems at scale, many discover an uncomfortable truth: semantic search often fails in ways that are hard to predict and even harder to debug. ...

11 min · 2169 words