Tensor-Parallelism

When GPT-4 was released in 2023, rumors suggested it contained over 1.7 trillion parameters. Training such a model requires approximately 25,000 A100 GPUs running for months—a feat that would be impossible without sophisticated distributed training systems. The challenge isn’t merely computational; it’s fundamentally a memory problem. A single 80GB A100 GPU can barely hold a 40B parameter model during training, let alone a trillion-parameter behemoth. This is the story of how systems researchers cracked the memory wall through a decade of innovations in data parallelism, ZeRO, tensor parallelism, and pipeline parallelism. ...