Epoch AI

Have AI Capabilities Accelerated?

Brief

Epoch AI's analysis finds that AI capabilities have recently accelerated on three of four measured metrics—ECI, the log of METR's 50% time horizon, and a composite Math Index—linked temporally to the rollout of "reasoning" models in 2024. The best single explanatory fit for these three metrics was a reasoning/non‑reasoning split: reasoning models show a discrete level jump plus roughly 2–3× steeper linear progress versus non‑reasoning models, although several other superlinear fits also perform well. The Combined Math Index was produced via a 2PL IRT fit across multiple math benchmarks (ridge penalties λa=0.378, λb=0.0001) and rescaled to map Sonnet 3.5→130 and GPT‑5→150. The study fit eight parametric curves by unweighted least squares and evaluated them with expanding‑window cross‑validation (primary horizon = 6 months), plus leave‑one‑out/contiguous‑block perturbation checks to assess stability. WeirdML V2 showed no acceleration, possibly because of task constraints and limited pre‑2024 data; authors caution the acceleration may be concentrated in domains—programming and math—where correctness is easily auto‑verified and RL has been heavily applied.

Why it matters

Three of four metrics—Epoch Capabilities Index (ECI), log METR 50% time horizon, and a combined Math Index—show clear acceleration coincident with the emergence of "reasoning" models in 2024; reasoning models exhibit a one‑off performance jump and an approximately 2–3× faster linear trend versus non‑reasoning models.

Key details

  • The Combined Math Index was built from MATH Level 5, Frontier Math tiers 1–4, OTIS Mock AIME 2024–2025, and MathArena using a 2‑parameter logistic IRT model (P = σ(a_j(θ_i − b_j))) with ridge penalties λ_a = 0.378 and λ_b = 0.0001; ability scores were rescaled so Sonnet 3.5 = 130 and GPT‑5 = 150.
  • Methodology: eight candidate parametric fits were compared (global linear; reasoning‑split two‑trend model; piecewise linear; log‑augmented linear; quadratic; power law; exponential; hyperbolic) fit by unweighted least squares and ranked with expanding‑window cross‑validation using 6‑month RMSE as the primary metric; data modes tested: SOTA‑on‑release, named releases, daily max, daily interpolated.
  • WeirdML V2 index showed no acceleration (best fit = single global linear trend); possible reasons include its constrained protocol (5 submission attempts, no external tools) and limited pre‑reasoning data (~1 year). Minimum training cutoffs used: ECI June 2024, METR Jan 2024, Combined Math Sept 2024, WeirdML V2 Jan 2025.
Reader · no content

No body text on file.

Open the original to read the full piece.