Next-Token Prediction

For years, the next-token prediction (NTP) paradigm has been the unquestioned foundation of large language model training. Given a sequence of tokens $x_{1:t}$, the model learns to maximize $P(x_{t+1} | x_{1:t})$. Simple, elegant, and remarkably effective—until you realize the fundamental inefficiency baked into this approach. The problem is that transformers spend the same computational budget predicting filler words (“the”, “and”, “is”) as they do on information-carrying tokens (“quantum”, “entanglement”, “superposition”). Research from Apple and EPFL reveals that over 50% of English text consists of function words—linguistic glue that carries minimal semantic weight. Yet models trained on NTP treat every token with equal reverence, creating a massive computational inefficiency. ...