Knowledge Distillation
Training Method
A training technique where a large, capable "teacher" model's knowledge is transferred to a smaller, cheaper "student" model — compressing frontier capabilities into deployable sizes by training the student to mimic the teacher's output distribution rather than learning from raw data alone.
Distillation trains a student model to match the teacher's output distribution (soft labels), not just the correct answer (hard labels). The teacher's probability distribution over all tokens contains 'dark knowledge' — information about which wrong answers are more plausible than others — that hard labels discard. Standard recipe: (1) Generate a large dataset of (input, teacher_output) pairs. (2) Train student with a loss that combines KL-divergence from teacher distribution (at temperature T > 1 to soften probabilities) with standard cross-entropy on ground truth. (3) Temperature scaling controls how much of the teacher's uncertainty is transferred. Practical examples: DeepSeek-R1 distilled into Llama/Qwen variants, GPT-4 knowledge reportedly distilled into smaller OpenAI models, Gemma distilled from Gemini.
Why Does This Exist?
The most direct path from frontier capability to affordable deployment. DeepSeek-R1 distilled into Llama-3.1-8B retains substantial reasoning capability at 1/50th the parameter count and a fraction of the inference cost. Distillation compresses the teacher's learned distribution into the student's smaller parameter space — transferring knowledge that the student could not have learned efficiently from raw data alone.
Distillation transfers capabilities to models that are too small to develop them independently from pretraining data. An 8B model cannot learn frontier-level reasoning from next-token prediction alone, but it can learn to mimic a 671B model's reasoning patterns via distillation. This is capability transfer — the student inherits capabilities that emerge only at scales it cannot reach on its own.
Task-specific distillation creates specialized compact models — distill a frontier model's medical knowledge into a small medical expert, its coding ability into a code expert, its reasoning into a reasoning expert. Each distilled model is an independent, deployable module with a clear capability scope. This creates de facto modularity at the model level rather than the weight level, enabling mix-and-match deployment architectures.