clarifying how much “task horizon” falls off as a function of the increasing accuracy criterion directly in the graph (not across tabs) would definitely improve this graph, @METR_Evals.
as a variant on the below, i would suggest you include lines for 50% criterion, etc up 80%, 90h, and 100%. not in tabs, but as lines directly reported, somewhat similar to the below.
NB you must always clarify in the title (unlike below) that the tasks in question are software engineering (not randomly sample from all human tasks etc)
Yafah Edelman (@YafahEdelman)
The new METR time horizon graph is pretty bad imo. It's a great benchmark, but the time horizon estimation isn't reasonable rn. I think something like this would be more justified:
— https://nitter.net/YafahEdelman/status/2054456009821966703#m