Speculative Decoding
Inference Optimization
An inference acceleration technique where a small, fast "draft" model proposes multiple tokens in parallel, then the large "target" model verifies them in a single forward pass — achieving 2-3x speedup with mathematically guaranteed identical output distribution.
The algorithm: (1) A small draft model (e.g. Llama-3.1-8B drafting for Llama-3.1-70B) autoregressively generates K candidate tokens (K typically 3-8). (2) The large target model runs a single forward pass on the entire draft sequence, producing probabilities for all K positions simultaneously. (3) Each draft token is accepted or rejected by comparing draft vs. target probability distributions using a rejection sampling scheme. Accepted tokens are kept; generation continues from the first rejected token. Key property: the output distribution is mathematically identical to running the target model alone — speculative decoding is lossless. Implementation: vLLM, TensorRT-LLM, and Hugging Face Transformers all support this natively.
Why Does This Exist?
Speculative decoding achieves 2-3x inference throughput improvement with mathematically identical output distribution — pure cost reduction with no quality tradeoff. For production deployments where inference cost dominates (API serving, high-volume applications), this directly translates to serving the same quality at half to a third of the compute cost. One of the few techniques that is genuinely free lunch.