Twitter/X

Gary Marcus (2026-05-10) warns not to panic about the Mythos/METR graph

Brief

Gary Marcus (May 10, 2026) urges calm over the Mythos/METR graph, stressing it measures ~50% success on software tasks—not high reliability or general intelligence. He acknowledges Claude Code and Mythos advances but says gains likely stem from neurosymbolic tools (interpreters, verification, harnesses) rather than pure scaling; per @ramez, Mythos remains on trend on the ECI benchmark.

Why it matters

Gary Marcus (2026-05-10) warns not to panic about the Mythos/METR graph: it reports ~50% success, applies to software tasks only, and does not demonstrate 90–100% reliability or general intelligence.

Key details

  • Marcus credits real advances like Claude Code but argues much recent progress likely comes from neurosymbolic components (code interpreters, verification, harnesses) rather than pure model scaling, so the graph isn’t proof that more compute or funding will perpetuate gains.
  • Per @ramez, Marcus notes Mythos is not off‑trend on the ECI benchmark; the METR graph does not show Mythos can reliably perform most human tasks (e.g., those taking 16 hours) across domains.
Source evidence

PLEASE DO NOT PANIC about the Mythos/METR graph that everyone is panicking about.

Progress is being made but people are totally overreacting.

Here’s some context that is being left out from nearly every comment on that graph.

Gary Marcus (@GaryMarcus)

Hot take on METR’s new graph that so many people are flipping about today.

• Claude Code is a real advance; Mythos probably builds on some of what is learned there. But…

• If you read the graph carefully, it is about achieving 50% success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all.

• If you read carefully, it is only about software tasks. Not general intelligence.

• It certainly doesn’t tell you that most (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably

• Aside from this, the graph doesn’t show you how the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph.

•  Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.

— https://nitter.net/GaryMarcus/status/2053126547499045306#m