Odd Lots

Understanding the Most Viral Chart in Artificial Intelligence

Brief

METER’s time‑horizon chart — the viral upward line that tracks how long a human would take to complete tasks that modern AIs can do — was the subject of a close examination by hosts Joe Wisenthal and Tracy Alloway with METER’s president Chris Painter and technical staffer Joel Becker. Painter framed METER’s mission: a Bay Area nonprofit building empirical science to assess when AI systems achieve the kind of sustained autonomy (especially in software engineering and ML research tasks) that would materially raise the stakes for alignment and catastrophic‑risk discussions. Becker walked through the core methodology: choose engineering‑focused tasks, recruit skilled human practitioners (~three baselines per task), time their completion under similar tools, then evaluate models on the same tasks and report the point (the 50% success threshold) where an AI is predicted to succeed half the time. The episode cited concrete numbers — Claude Opus 4.6 measured at 11h 59m for the 50% horizon (Feb 2026), GPT‑5.3 Codex at 5h 50m, and Gemini 3 Pro at 3h 44m (Nov 2025) — and emphasized that the chart signals a recent acceleration in capability doubling (from ~6–7 months historically to ~4 months in the latest tranche of models).

The conversation balanced enthusiasm about the clarity of a single, interpretable metric with sober caveats. Becker and Painter acknowledged important limits: narrow task coverage (mostly engineering work), small human sample sizes, grading noise, monetary incentives for human baseliners, and the difficulty of scaling baselines as horizon lengths grow. They defended the 50% threshold as easier to measure and statistically robust compared with 80–90% reliability levels (which require far larger samples), but conceded that downstream business utility and real‑world productivity gains will depend on reliability, messiness of real tasks, and verification overhead. Painter and Becker also discussed external dynamics: compute investments have risen exponentially and are roughly tracking capability gains; Chinese models appear roughly 9–12 months behind frontier US models on METER’s held‑out tasks; investors sometimes use the charts for decision‑making; and METER’s ~30‑person nonprofit team faces a talent and resource bottleneck even as labs are receptive to third‑party evaluations. Hosts and guests agreed the metric is informative and alarming about rapid progress, while also agreeing more breadth, larger human baselines, and broader public and governmental engagement are needed before the chart can definitively answer when autonomy becomes an existential risk.

Why it matters

Chris Painter (president, METER) said METER is a Bay Area research nonprofit focused on measuring when AI systems acquire enough autonomy to pose catastrophic risks, especially via software engineering and machine‑learning tasks.

Key details

  • Joel Becker (technical staff, METER) explained METER's core metric: a model's 'time horizon' is the human average time-to-complete equivalent tasks; METER reported Claude Opus 4.6 (Feb 2026) at a 50% success time horizon of 11 hours 59 minutes versus GPT‑5.3 Codex at 5 hours 50 minutes.
  • Becker said METER uses ~3 human baselines per task (talented practitioners timed under similar conditions) and labels task difficulty by the average human time-to-complete; METER picks the 50% reliability threshold because it is statistically easier to estimate than higher thresholds.
  • Hosts and guests noted limitations: small human sample sizes, narrow task distribution (primarily engineering tasks, not e.g. painting), potential baseline incentives (paid humans, bonuses), and risks of benchmark gaming — all acknowledged by Joel Becker as weakening precision if extrapolated.
  • Painter and Becker reported a recent acceleration: METER’s observed doubling time in capability (time horizon) has shortened from ~6–7 months historically to roughly 4 months for the most recent model generations (post‑GPT‑4 era).
  • Painter noted compute and R&D spend have risen exponentially and are roughly tracking the time‑horizon improvements; many data‑center commitments are already ‘baked in,’ making short‑term capability slowdowns unlikely.
Reader · no content

No body text on file.

Open the original to read the full piece.