Goal Tensions

Improving one research goal often makes another harder. These are the known tradeoffs.

5 documented tensions

Making models interpretable often requires additional computation (SAE decomposition, activation analysis) or architectural constraints (simpler circuits) that conflict with raw efficiency.

Known Tradeoff

SAE analysis of a single forward pass adds ~10-100x compute overhead. Interpretability-constrained architectures tend to underperform dense models by 2-5% on benchmarks.

Active Research

Anthropic and others working on efficient SAE training. Research into architectures that are inherently more interpretable without capability cost.

Frontier capability requires massive compute (scaling laws), while cost efficiency means using less compute. The two goals are directly opposed — you can't simultaneously push the frontier and make it cheap.

Known Tradeoff

Roughly 10x compute → 1 benchmark point improvement at the frontier. Distillation can close ~70-80% of the gap at 10-100x less cost, but the remaining 20% requires frontier-scale compute.

Active Research

DeepSeek demonstrating much cheaper frontier training. Test-time compute as an alternative scaling axis. Distillation and synthetic data as force multipliers.

Training models to express uncertainty or abstain from answering may reduce apparent capability on benchmarks (which reward confident answers). RLHF pressure toward helpfulness conflicts with honest uncertainty expression.

Known Tradeoff

Models trained with strong calibration incentives score 1-3% lower on standard benchmarks due to increased abstention and hedging.

Active Research

Research into reward models that value calibrated uncertainty alongside correctness. Epistemic RLHF that rewards 'I don't know' when appropriate.

Removing knowledge may degrade model capability in related areas due to the distributed nature of knowledge storage (superposition). Aggressive unlearning risks collateral damage.

Known Tradeoff

ROME/MEMIT edits sometimes cause 0.5-2% degradation on neighboring knowledge when many edits are applied.

Active Research

Modular architectures that localize knowledge, making surgical removal possible without collateral damage.

The most capable models may be the least interpretable — capability may require complex, entangled representations that resist clean decomposition into human-understandable components.

Known Tradeoff

No quantified tradeoff yet, but toy models that are fully interpretable tend to be less capable than similar-sized opaque models.

Active Research

Hypothesis: interpretability and capability may converge at sufficient scale (features become cleaner in larger models). Active investigation by Anthropic and DeepMind.