Knowledge Distillation

Production

Training Method

A training technique where a large, capable "teacher" model's knowledge is transferred to a smaller, cheaper "student" model — compressing frontier capabilities into deployable sizes by training the student to mimic the teacher's output distribution rather than learning from raw data alone.

Distillation trains a student model to match the teacher's output distribution (soft labels), not just the correct answer (hard labels). The teacher's probability distribution over all tokens contains 'dark knowledge' — information about which wrong answers are more plausible than others — that hard labels discard. Standard recipe: (1) Generate a large dataset of (input, teacher_output) pairs. (2) Train student with a loss that combines KL-divergence from teacher distribution (at temperature T > 1 to soften probabilities) with standard cross-entropy on ground truth. (3) Temperature scaling controls how much of the teacher's uncertainty is transferred. Practical examples: DeepSeek-R1 distilled into Llama/Qwen variants, GPT-4 knowledge reportedly distilled into smaller OpenAI models, Gemma distilled from Gemini.

Why Does This Exist?

Cost-Efficient Frontier Intelligence →Research Goal

The most direct path from frontier capability to affordable deployment. DeepSeek-R1 distilled into Llama-3.1-8B retains substantial reasoning capability at 1/50th the parameter count and a fraction of the inference cost. Distillation compresses the teacher's learned distribution into the student's smaller parameter space — transferring knowledge that the student could not have learned efficiently from raw data alone.

Raw Capability Scaling →Research Goal

Distillation transfers capabilities to models that are too small to develop them independently from pretraining data. An 8B model cannot learn frontier-level reasoning from next-token prediction alone, but it can learn to mimic a 671B model's reasoning patterns via distillation. This is capability transfer — the student inherits capabilities that emerge only at scales it cannot reach on its own.

Modular Knowledge Architecture →Research Goal

Task-specific distillation creates specialized compact models — distill a frontier model's medical knowledge into a small medical expert, its coding ability into a code expert, its reasoning into a reasoning expert. Each distilled model is an independent, deployable module with a clear capability scope. This creates de facto modularity at the model level rather than the weight level, enabling mix-and-match deployment architectures.