Goal Tensions
Improving one research goal often makes another harder. These are the known tradeoffs.
5 documented tensions
Making models interpretable often requires additional computation (SAE decomposition, activation analysis) or architectural constraints (simpler circuits) that conflict with raw efficiency.
Known Tradeoff
SAE analysis of a single forward pass adds ~10-100x compute overhead. Interpretability-constrained architectures tend to underperform dense models by 2-5% on benchmarks.
Active Research
Anthropic and others working on efficient SAE training. Research into architectures that are inherently more interpretable without capability cost.
Frontier capability requires massive compute (scaling laws), while cost efficiency means using less compute. The two goals are directly opposed — you can't simultaneously push the frontier and make it cheap.
Known Tradeoff
Roughly 10x compute → 1 benchmark point improvement at the frontier. Distillation can close ~70-80% of the gap at 10-100x less cost, but the remaining 20% requires frontier-scale compute.
Active Research
DeepSeek demonstrating much cheaper frontier training. Test-time compute as an alternative scaling axis. Distillation and synthetic data as force multipliers.
Training models to express uncertainty or abstain from answering may reduce apparent capability on benchmarks (which reward confident answers). RLHF pressure toward helpfulness conflicts with honest uncertainty expression.
Known Tradeoff
Models trained with strong calibration incentives score 1-3% lower on standard benchmarks due to increased abstention and hedging.
Active Research
Research into reward models that value calibrated uncertainty alongside correctness. Epistemic RLHF that rewards 'I don't know' when appropriate.
Removing knowledge may degrade model capability in related areas due to the distributed nature of knowledge storage (superposition). Aggressive unlearning risks collateral damage.
Known Tradeoff
ROME/MEMIT edits sometimes cause 0.5-2% degradation on neighboring knowledge when many edits are applied.
Active Research
Modular architectures that localize knowledge, making surgical removal possible without collateral damage.
The most capable models may be the least interpretable — capability may require complex, entangled representations that resist clean decomposition into human-understandable components.
Known Tradeoff
No quantified tradeoff yet, but toy models that are fully interpretable tend to be less capable than similar-sized opaque models.
Active Research
Hypothesis: interpretability and capability may converge at sufficient scale (features become cleaner in larger models). Active investigation by Anthropic and DeepMind.