Twitter/X

@iruletheworldmo: a new best benchmark has entered the arena. there’s so much value in creating great benchmarks. wh...

a new best benchmark has entered the arena.

there’s so much value in creating great benchmarks. what are you waiting for? you hate money?

Kilian Lieret (@KLieret)

The first ProgramBench task was just solved by GPT 5.5 high/xhigh. Interestingly, high/xhigh picked two different languages for the task (C vs Python). GPT 5.5 xhigh was significantly better than Opus 4.7 xhigh in all metrics. 🧵

— https://nitter.net/KLieret/status/2054215545663144217#m