Post-Training

The standard RLHF pipeline was never elegant. Train a reward model from human preferences, then use Proximal Policy Optimization (PPO) to maximize that reward while staying close to your original model—requiring four separate models in memory during training, sampling from the policy during optimization, and navigating a landscape of hyperparameter sensitivity that could turn a week of training into a costly failure. Direct Preference Optimization (DPO) changed everything. By recognizing that the optimal policy under a KL-constrained reward maximization objective could be derived in closed form, DPO eliminated reinforcement learning entirely. What followed was an explosion of variants—KTO, ORPO, SimPO, IPO, AlphaDPO—each addressing different limitations with different inductive biases. Understanding when to use which method requires understanding not just their formulas, but the assumptions they encode about human preferences and the trade-offs they make between data requirements, computational efficiency, and alignment quality. ...