PLEASE DO NOT PANIC about the Mythos/METR graph that everyone is panicking about.
Progress is being made but people are totally overreacting.
Here’s some context that is being left out from nearly every comment on that graph.
Gary Marcus (@GaryMarcus)
Hot take on METR’s new graph that so many people are flipping about today.
• Claude Code is a real advance; Mythos probably builds on some of what is learned there. But…
• If you read the graph carefully, it is about achieving 50% success. Not 100 or 99 or even 90. The key problem with GenAI has been reliability; this graph does not address reliable performance. At all.
• If you read carefully, it is only about software tasks. Not general intelligence.
• It certainly doesn’t tell you that most (let alone) all things that humans can do in 16 hours can be done in Mythos, let alone reliably
• Aside from this, the graph doesn’t show you how the improvements have been made. As noted in my newsletter a lot of the advance in recent months is likely from the incorporation of symbolic tools (like code interpreters, verification, and harnesses) rather than from model scaling per se. As such this a vindication of neurosymbolic AI – but not a proof that LLMs themselves can be perpetually scaled. As such it’s not a proof that another trillion dollars will continue the graph.
• Per @ramez, Mythos is not actually off trend on the ECI benchmark, which is a broader measure.
— https://nitter.net/GaryMarcus/status/2053126547499045306#m