Raw Capability Scaling
Active ResearchPush the absolute frontier of what AI systems can do — achieving human-expert-level or superhuman performance on reasoning, code, math, science, and creative tasks.
Raw capability is measured by benchmarks like MMLU, GPQA (PhD-level science), SWE-bench (real software engineering), MATH, and ARC-AGI. Key drivers: scale (more parameters, more data), architecture innovations (MoE, test-time compute), training methodology (RLHF, DPO, process reward models), and data quality (synthetic data, curriculum learning). The current frontier is models that can do multi-step reasoning, use tools, and self-correct.
Why Is This Hard?
The Core Difficulty
Scaling laws show diminishing returns. Evaluating 'true intelligence' vs pattern matching is philosophically contested. And the training data is running out — we may be approaching the limit of internet-scale pretraining.
The Fundamental Tension
Scaling requires exponentially more resources for linear capability gains. And the benchmarks we use to measure 'capability' may not capture what matters for real-world usefulness.
Who Feels This
Everyone — this is the primary axis of AI progress that determines what applications become possible.
What Failure Looks Like
Models still struggle with: novel multi-step reasoning, tasks requiring true world models, robust generalization to out-of-distribution problems, consistent reliability (getting it right every time, not just usually).
Where Research Stands
Current Approaches
Scale (more parameters, data, compute), architecture innovation (MoE, SSM hybrids), training methodology (RL for reasoning), test-time compute (extended thinking), multi-agent collaboration, tool use, retrieval augmentation.
Best Result So Far
Frontier models achieve 85-90%+ on MMLU, solve competitive programming problems, pass medical and legal exams, generate production-quality code for moderate-complexity tasks.
Remaining Gaps
Reliability (getting it right consistently, not just on average), novel reasoning (truly out-of-distribution generalization), long-horizon planning, self-awareness of limitations, robust world models.
What a Breakthrough Looks Like
Either: continued scaling with architectural innovations, OR fundamentally new architectures that enable genuine abstraction and reasoning, OR hybrid systems that combine neural capability with formal reasoning.
What Success Looks Like
AI systems that reliably perform at human-expert level across all cognitive tasks — reasoning, code, math, science, writing, planning — with consistent reliability (not just good on average) and the ability to generalize to genuinely novel problems.
Timeline Horizon
3-10 years
Techniques That Address This
CoT is the single largest capability unlockfor complex reasoning tasks. On GSM8K (math), CoT prompting improved PaLM-540B from 56% to 74%. Trained reasoning models (o1, R1) use CoT as the foundation of test-time compute scaling — spending more inference tokens to achieve qualitatively new capabilities on math, code, and science that the base model cannot achieve at all.
Distillation transfers capabilities to models that are too small to develop them independently from pretraining data. An 8B model cannot learn frontier-level reasoning from next-token prediction alone, but it can learn to mimic a 671B model's reasoning patterns via distillation. This is capability transfer — the student inherits capabilities that emerge only at scales it cannot reach on its own.
By reducing attention's memory footprint from O(N²) to O(N), FlashAttention made long context windows practically feasible. Models with 128K-1M token contexts (GPT-4, Claude, Gemini) would be economically unviable without it. Longer context directly enables new capabilities: processing entire codebases, books, document collections, and multi-turn agent interactions that were previously impossible.
Enables larger models at the same compute budget — DeepSeek-V3 and GPT-4 use MoE to achieve frontier capability without proportional compute scaling
Combines neural flexibility (understanding messy real-world input) with symbolic precision (guaranteed-correct computation) — potentially stronger than either alone on tasks requiring both
RAG extends a model's effective knowledge far beyond its training data cutoff and memorization capacity. A 7B model with access to a curated retrieval corpus can outperform a 70B model on domain-specific factual questions — capability via access rather than via parameters.
RLHF transforms raw pretrained models into usable ones. The capability unlock is not in raw benchmark scores but in instruction following, task completion, and output quality — the "last mile" that makes a model practically useful. Without alignment training, even the largest models produce poorly formatted, unhelpful, or unsafe outputs. RLHF/DPO is what makes capability accessible.
Tensions With Other Goals
Frontier capability requires massive compute (scaling laws), while cost efficiency means using less compute. The two goals are directly opposed — you can't simultaneously push the frontier and make it cheap.
Known Tradeoff
Roughly 10x compute → 1 benchmark point improvement at the frontier. Distillation can close ~70-80% of the gap at 10-100x less cost, but the remaining 20% requires frontier-scale compute.
Active Research
DeepSeek demonstrating much cheaper frontier training. Test-time compute as an alternative scaling axis. Distillation and synthetic data as force multipliers.
Training models to express uncertainty or abstain from answering may reduce apparent capability on benchmarks (which reward confident answers). RLHF pressure toward helpfulness conflicts with honest uncertainty expression.
Known Tradeoff
Models trained with strong calibration incentives score 1-3% lower on standard benchmarks due to increased abstention and hedging.
Active Research
Research into reward models that value calibrated uncertainty alongside correctness. Epistemic RLHF that rewards 'I don't know' when appropriate.
Removing knowledge may degrade model capability in related areas due to the distributed nature of knowledge storage (superposition). Aggressive unlearning risks collateral damage.
Known Tradeoff
ROME/MEMIT edits sometimes cause 0.5-2% degradation on neighboring knowledge when many edits are applied.
Active Research
Modular architectures that localize knowledge, making surgical removal possible without collateral damage.
The most capable models may be the least interpretable — capability may require complex, entangled representations that resist clean decomposition into human-understandable components.
Known Tradeoff
No quantified tradeoff yet, but toy models that are fully interpretable tend to be less capable than similar-sized opaque models.
Active Research
Hypothesis: interpretability and capability may converge at sufficient scale (features become cleaner in larger models). Active investigation by Anthropic and DeepMind.
Real-World Pressure
Competitive pressure between AI labs. National strategic interest. Market demand for automation.
Key Organisations
Key Benchmarks