RLHF / DPO (Reinforcement Learning from Human Feedback / Direct Preference Optimization)

Production

Training Method

Post-training alignment techniques that steer model behavior toward human preferences — converting a raw pretrained model into one that is helpful, harmless, and honest by learning from human judgments of output quality.

The RLHF pipeline has three stages: (1) Supervised Fine-Tuning (SFT) — train the base model on high-quality instruction-response pairs. (2) Reward Model (RM) training — train a separate model on human preference data (pairs of responses ranked by quality). (3) RL optimization — use PPO (Proximal Policy Optimization) to fine-tune the SFT model to maximize the reward model's score, with a KL-divergence penalty to prevent reward hacking. DPO (Rafailov et al., 2023) collapses steps 2-3 into a single supervised loss that directly optimizes from preference pairs without an explicit reward model. Simpler to implement, more stable to train, competitive results. Alternatives: KTO (Kahneman-Tversky Optimization — works with binary feedback, not pairs), RLAIF (use AI as the preference rater), IPO, ORPO.

Why Does This Exist?

Raw Capability Scaling →Research Goal

RLHF transforms raw pretrained models into usable ones. The capability unlock is not in raw benchmark scores but in instruction following, task completion, and output quality — the "last mile" that makes a model practically useful. Without alignment training, even the largest models produce poorly formatted, unhelpful, or unsafe outputs. RLHF/DPO is what makes capability accessible.

Epistemic Control →Research Goal

Preference training can explicitly reward calibrated uncertainty — training models to say "I'm not sure" when appropriate rather than always producing a confident answer. Constitutional AI and preference data that rewards honesty over helpfulness directly target epistemic calibration. The tension: standard RLHF optimizes for user satisfaction, which often rewards confident-sounding (not necessarily correct) answers.

Machine Unlearning →Research Goal

RLHF can suppress specific behaviors and outputs via negative preference data — training the model to avoid producing certain content. However, suppression through RLHF is not true unlearning: the knowledge remains in the parameters and can potentially be extracted via jailbreaks or fine-tuning. RLHF-based suppression is a practical first line of defense, but it highlights the gap between behavioral control and genuine knowledge removal.