Raw Capability Scaling

Active Research

Push the absolute frontier of what AI systems can do — achieving human-expert-level or superhuman performance on reasoning, code, math, science, and creative tasks.

55% mature

Raw capability is measured by benchmarks like MMLU, GPQA (PhD-level science), SWE-bench (real software engineering), MATH, and ARC-AGI. Key drivers: scale (more parameters, more data), architecture innovations (MoE, test-time compute), training methodology (RLHF, DPO, process reward models), and data quality (synthetic data, curriculum learning). The current frontier is models that can do multi-step reasoning, use tools, and self-correct.

Why Is This Hard?

The Core Difficulty

Scaling laws show diminishing returns. Evaluating 'true intelligence' vs pattern matching is philosophically contested. And the training data is running out — we may be approaching the limit of internet-scale pretraining.

The Fundamental Tension

Scaling requires exponentially more resources for linear capability gains. And the benchmarks we use to measure 'capability' may not capture what matters for real-world usefulness.

Who Feels This

Everyone — this is the primary axis of AI progress that determines what applications become possible.

What Failure Looks Like

Models still struggle with: novel multi-step reasoning, tasks requiring true world models, robust generalization to out-of-distribution problems, consistent reliability (getting it right every time, not just usually).

Where Research Stands

Current Approaches

Scale (more parameters, data, compute), architecture innovation (MoE, SSM hybrids), training methodology (RL for reasoning), test-time compute (extended thinking), multi-agent collaboration, tool use, retrieval augmentation.

Best Result So Far

Frontier models achieve 85-90%+ on MMLU, solve competitive programming problems, pass medical and legal exams, generate production-quality code for moderate-complexity tasks.

Remaining Gaps

Reliability (getting it right consistently, not just on average), novel reasoning (truly out-of-distribution generalization), long-horizon planning, self-awareness of limitations, robust world models.

What a Breakthrough Looks Like

Either: continued scaling with architectural innovations, OR fundamentally new architectures that enable genuine abstraction and reasoning, OR hybrid systems that combine neural capability with formal reasoning.

What Success Looks Like

AI systems that reliably perform at human-expert level across all cognitive tasks — reasoning, code, math, science, writing, planning — with consistent reliability (not just good on average) and the ability to generalize to genuinely novel problems.

Timeline Horizon

3-10 years

Techniques That Address This

Chain-of-Thought Prompting / Reasoning Traces →Building Block

CoT is the single largest capability unlockfor complex reasoning tasks. On GSM8K (math), CoT prompting improved PaLM-540B from 56% to 74%. Trained reasoning models (o1, R1) use CoT as the foundation of test-time compute scaling — spending more inference tokens to achieve qualitatively new capabilities on math, code, and science that the base model cannot achieve at all.

Knowledge Distillation →Building Block

Distillation transfers capabilities to models that are too small to develop them independently from pretraining data. An 8B model cannot learn frontier-level reasoning from next-token prediction alone, but it can learn to mimic a 671B model's reasoning patterns via distillation. This is capability transfer — the student inherits capabilities that emerge only at scales it cannot reach on its own.

FlashAttention →Building Block

By reducing attention's memory footprint from O(N²) to O(N), FlashAttention made long context windows practically feasible. Models with 128K-1M token contexts (GPT-4, Claude, Gemini) would be economically unviable without it. Longer context directly enables new capabilities: processing entire codebases, books, document collections, and multi-turn agent interactions that were previously impossible.

Mixture of Experts (MoE) →Building Block

Enables larger models at the same compute budget — DeepSeek-V3 and GPT-4 use MoE to achieve frontier capability without proportional compute scaling

Neurosymbolic Hybrid Architectures →Building Block

Combines neural flexibility (understanding messy real-world input) with symbolic precision (guaranteed-correct computation) — potentially stronger than either alone on tasks requiring both

Retrieval-Augmented Generation (RAG) →Building Block

RAG extends a model's effective knowledge far beyond its training data cutoff and memorization capacity. A 7B model with access to a curated retrieval corpus can outperform a 70B model on domain-specific factual questions — capability via access rather than via parameters.

RLHF / DPO (Reinforcement Learning from Human Feedback / Direct Preference Optimization) →Building Block

RLHF transforms raw pretrained models into usable ones. The capability unlock is not in raw benchmark scores but in instruction following, task completion, and output quality — the "last mile" that makes a model practically useful. Without alignment training, even the largest models produce poorly formatted, unhelpful, or unsafe outputs. RLHF/DPO is what makes capability accessible.

Tensions With Other Goals

Cost-Efficient Frontier Intelligence →

Frontier capability requires massive compute (scaling laws), while cost efficiency means using less compute. The two goals are directly opposed — you can't simultaneously push the frontier and make it cheap.

Known Tradeoff

Roughly 10x compute → 1 benchmark point improvement at the frontier. Distillation can close ~70-80% of the gap at 10-100x less cost, but the remaining 20% requires frontier-scale compute.

Active Research

DeepSeek demonstrating much cheaper frontier training. Test-time compute as an alternative scaling axis. Distillation and synthetic data as force multipliers.

Epistemic Control →

Training models to express uncertainty or abstain from answering may reduce apparent capability on benchmarks (which reward confident answers). RLHF pressure toward helpfulness conflicts with honest uncertainty expression.

Known Tradeoff

Models trained with strong calibration incentives score 1-3% lower on standard benchmarks due to increased abstention and hedging.

Active Research

Research into reward models that value calibrated uncertainty alongside correctness. Epistemic RLHF that rewards 'I don't know' when appropriate.

Machine Unlearning →

Removing knowledge may degrade model capability in related areas due to the distributed nature of knowledge storage (superposition). Aggressive unlearning risks collateral damage.

Known Tradeoff

ROME/MEMIT edits sometimes cause 0.5-2% degradation on neighboring knowledge when many edits are applied.

Active Research

Modular architectures that localize knowledge, making surgical removal possible without collateral damage.

Mechanistic Interpretability →

The most capable models may be the least interpretable — capability may require complex, entangled representations that resist clean decomposition into human-understandable components.

Known Tradeoff

No quantified tradeoff yet, but toy models that are fully interpretable tend to be less capable than similar-sized opaque models.

Active Research

Hypothesis: interpretability and capability may converge at sufficient scale (features become cleaner in larger models). Active investigation by Anthropic and DeepMind.

See all goal tensions →

Real-World Pressure

Competitive pressure between AI labs. National strategic interest. Market demand for automation.

Key Organisations

OpenAIAnthropicGoogle DeepMindMeta FAIRDeepSeekxAI

Key Benchmarks

MMLUGPQASWE-benchMATHARC-AGIHumanEvalFrontierMath