From 1% Parameters to Full Capacity: The Mathematics and Engineering Behind LoRA's Evolution

Fine-tuning a 7-billion parameter model used to demand 100+ GB of VRAM—roughly the memory of four A100 GPUs. Today, the same task runs on a consumer RTX 4090 with 24 GB. This 4× reduction didn’t come from better hardware; it came from a mathematical insight about the structure of neural network adaptations. Low-Rank Adaptation (LoRA), introduced by Microsoft in 2021, fundamentally changed how we think about model fine-tuning. The core idea is deceptively simple: instead of updating all parameters, inject small trainable matrices that modify the model’s behavior. But behind this simplicity lies deep connections to linear algebra, information theory, and the geometry of neural network weight spaces. ...

4 min · 1660 words

How Flash Attention Revolutionized LLM Training: The IO-Aware Algorithm Behind Modern Long-Context Models

In 2022, training a transformer with 16K context length required either massive GPU memory or accepting severe approximations. Standard attention’s memory grew quadratically with sequence length—a 32K context demanded over 4GB just for intermediate attention matrices. Then Flash Attention arrived, reducing memory from $O(N^2)$ to $O(N)$ while computing exact attention, not an approximation. This breakthrough enabled GPT-4’s 128K context window, Llama’s extended sequences, and virtually every modern long-context LLM. The key insight wasn’t algorithmic cleverness alone—it was understanding that on modern GPUs, memory bandwidth, not compute, is the bottleneck. ...

10 min · 1924 words

How Ring Attention Breaks the Memory Barrier: Enabling Million-Token Contexts Through Distributed Computation

In April 2025, Meta’s Llama 4 Scout achieved something previously thought impossible: processing 10 million tokens in a single context window. To put this in perspective, that’s roughly 20 novels, 40 hours of video, or an entire mid-sized codebase—all in one prompt. The secret behind this breakthrough isn’t a revolutionary new model architecture or exotic hardware. It’s a clever distributed computing technique called Ring Attention that fundamentally rethinks how we compute attention across multiple GPUs. ...

7 min · 1456 words