Twitter/X

Gary Marcus (posted 2026-05-13) told @METR_Evals to plot how “task horizon” falls…

Brief

Gary Marcus (2026-05-13) urged @METR_Evals to redesign their METR time-horizon graph by showing how task horizon declines as accuracy requirements rise directly on the plot, adding lines for 50%, and up to 80%, 90h, and 100% thresholds, and explicitly labeling the title to state tasks are software engineering, echoing Yafah Edelman’s critique of the current visualization.

Why it matters

Gary Marcus (posted 2026-05-13) told @METR_Evals to plot how “task horizon” falls off as the accuracy criterion increases directly on the main graph rather than across tabs to improve clarity.

Key details

  • He recommended adding direct lines for multiple thresholds — e.g., 50% criterion and up to 80%, 90h, and 100% — not in separate tabs but shown on the same plot.
  • Marcus insisted the graph title must explicitly state the evaluated tasks are software engineering (not a random sample of human tasks); this responds to Yafah Edelman’s critique that the current METR time-horizon visualization is “pretty bad.”
Source evidence

clarifying how much “task horizon” falls off as a function of the increasing accuracy criterion directly in the graph (not across tabs) would definitely improve this graph, @METR_Evals.

as a variant on the below, i would suggest you include lines for 50% criterion, etc up 80%, 90h, and 100%. not in tabs, but as lines directly reported, somewhat similar to the below.

NB you must always clarify in the title (unlike below) that the tasks in question are software engineering (not randomly sample from all human tasks etc)

Yafah Edelman (@YafahEdelman)

The new METR time horizon graph is pretty bad imo. It's a great benchmark, but the time horizon estimation isn't reasonable rn. I think something like this would be more justified:

— https://nitter.net/YafahEdelman/status/2054456009821966703#m