RLHF / DPO (Reinforcement Learning from Human Feedback / Direct Preference Optimization)

Production

Training Method

Post-training alignment techniques that steer model behavior toward human preferences — converting a raw pretrained model into one that is helpful, harmless, and honest by learning from human judgments of output quality.

The RLHF pipeline has three stages: (1) Supervised Fine-Tuning (SFT) — train the base model on high-quality instruction-response pairs. (2) Reward Model (RM) training — train a separate model on human preference data (pairs of responses ranked by quality). (3) RL optimization — use PPO (Proximal Policy Optimization) to fine-tune the SFT model to maximize the reward model's score, with a KL-divergence penalty to prevent reward hacking. DPO (Rafailov et al., 2023) collapses steps 2-3 into a single supervised loss that directly optimizes from preference pairs without an explicit reward model. Simpler to implement, more stable to train, competitive results. Alternatives: KTO (Kahneman-Tversky Optimization — works with binary feedback, not pairs), RLAIF (use AI as the preference rater), IPO, ORPO.

Why Does This Exist?

RLHF transforms raw pretrained models into usable ones. The capability unlock is not in raw benchmark scores but in instruction following, task completion, and output quality — the "last mile" that makes a model practically useful. Without alignment training, even the largest models produce poorly formatted, unhelpful, or unsafe outputs. RLHF/DPO is what makes capability accessible.

Preference training can explicitly reward calibrated uncertainty — training models to say "I'm not sure" when appropriate rather than always producing a confident answer. Constitutional AI and preference data that rewards honesty over helpfulness directly target epistemic calibration. The tension: standard RLHF optimizes for user satisfaction, which often rewards confident-sounding (not necessarily correct) answers.

RLHF can suppress specific behaviors and outputs via negative preference data — training the model to avoid producing certain content. However, suppression through RLHF is not true unlearning: the knowledge remains in the parameters and can potentially be extracted via jailbreaks or fine-tuning. RLHF-based suppression is a practical first line of defense, but it highlights the gap between behavioral control and genuine knowledge removal.