Useless to say this is even truer with standardized benchmarks where benchmaxing is a real temptation and actually is part of the game. Yet: benchmarks tell us that GPT 5.5 is generally better than Opus, which is IMHO true. So they have value.
Useless to say this is even truer with standardized benchmarks where benchmaxing is a real temptation and actually is part of the game. Yet: benchmarks tell us that GPT 5.5 is generally better than Opus, which is IMHO true. So they have value.