Mixture of Experts (MoE)
Architecture Pattern
A sparse architecture where each input is processed by only a subset of the model's total parameters, selected by a learned routing mechanism — enabling models to scale total knowledge while keeping per-input computation constant.
MoE replaces the dense feed-forward network (FFN) in each transformer block with N parallel FFN 'experts' and a gating/routing network. For each token, the router produces a probability distribution over experts and selects the top-k (usually 1-2). Only those experts execute, so FLOPs scale with active parameters, not total parameters. Example: Mixtral 8x7B has 47B total params but only ~13B active per token. DeepSeek-V3 uses 256 fine-grained experts with 8 active. The router is typically a simple linear layer trained end-to-end with the experts.
Why Does This Exist?
Core value proposition: scale knowledge (total parameters) while keeping inference cost constant (active parameters). Best FLOPs-to-quality ratio currently available
If experts specialize (an open question), MoE provides architectural modularity — capabilities mapped to specific experts that could theoretically be added, removed, or updated independently
Enables larger models at the same compute budget — DeepSeek-V3 and GPT-4 use MoE to achieve frontier capability without proportional compute scaling
If experts specialize interpretably, MoE could make model internals more modular and thus more analyzable — though current evidence for clean specialization is mixed