a new best benchmark has entered the arena.
there’s so much value in creating great benchmarks. what are you waiting for? you hate money?
Kilian Lieret (@KLieret)
The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵
— https://nitter.net/KLieret/status/2054215545663144217#m