LessWrong

OpenAI's red line for AI self-improvement is fundamentally flawed

2026-05-02 · 14:44 UTC ·Charbel-Raphaël ·4 min read

Brief

Charbel‑Raphaël contends OpenAI’s Preparedness Framework v2 sets a fundamentally flawed Critical threshold for AI self‑improvement because it is too permissive, unmeasurable, and self‑certified. He argues the leading indicator (“superhuman research‑scientist agent”) arrives after models already outperform humans, while the lagging indicator—5× generational acceleration sustained for several months—can permit roughly three years of progress before firing; Anthropic reportedly uses a 2× threshold. He highlights missing operational definitions, no cross‑release equivalence metric, and cites Epoch’s pre‑training efficiency 95% CI of [4.5,14.3] months to show large measurement uncertainty. He also calls out Section 4.3’s escape hatch and the absence of external evaluators for self‑improvement (unlike bio/cyber/sandbagging). His remedy: independent evaluation and a concrete red line (e.g., halt when METR’s p50 ≤ 2 months), noting at a ~3.5‑month doubling rate this implies ~2 years to prepare. Community replies were brief and largely tangential, offering alternative framing (attribute substitution, enthalpy vs free‑energy) rather than direct refutation.

Why it matters

Charbel-Raphaël (LessWrong, published 2026-05-02) argues OpenAI’s Preparedness Framework v2 sets a Critical red line for AI self‑improvement that fires too late: the leading indicator (“a superhuman research‑scientist agent”) only triggers after models can out‑research top humans, and the lagging indicator requires a 5× generational acceleration “sustained for several months,” which the author shows can allow roughly ~3 years of effective progress to accumulate before the trigger fires (Anthropic uses a 2× threshold).

Key details

The author flags measurement and certification failures: the framework lacks an operational definition of “generational improvement,” no cross‑release equivalence metric, no specification of “several months,” and self‑improvement is the only tracked GPT‑5.5 category with zero external evaluators (bio/cyber/sandbagging have SecureBio, US CAISI, Irregular, UK AISI, Apollo listed).
Quantitative uncertainty undermines the lagging rule: Epoch’s pre‑training algorithmic‑efficiency estimate has a 95% CI of [4.5, 14.3] months (≈3× uncertainty), recent capability leaps are driven by post‑training routes (RL, reasoning, tool use) that are measured even less precisely, and a claimed “5× acceleration sustained for several months” could fall inside that measurement noise.
Concrete fixes proposed: require independent evaluation and concrete precommitments — e.g., halt development when METR’s p50 time horizon crosses 2 months (the author analogizes ~2 months to a human writing a NeurIPS paper); at the current ~3.5‑month doubling rate the author estimates ~2 years to prepare. Community comments were sparse and tangential: romeostevensit suggested upstream attribute‑substitution effects, James Camacho contrasted “enthalpymaxxing” vs “free‑energymaxxing,” and Steffee posted an unrelated note about From games and personal growth.

Reader · no content

No body text on file.

Open the original to read the full piece.