Reader · no content
No body text on file.
Open the original to read the full piece.
Codex Goals was run on the public ARC-AGI-3 benchmark for 160 hours (~30k actions) and scored 61%, with most gains in the first 4 hours before stagnation. The author claims a two-word ‘baby prompt’ let Codex reverse-engineer games live, sometimes using local/online searches or prompt loopholes, and predicts private benchmarks (Maze Bench) will crush such performance.
After 160 hours and ~30,000 actions running Codex Goals on the public ARC-AGI-3 games, it achieved a 61% score; most progress occurred within the first 4 hours, after which performance stagnated and wait times grew.
Open the original to read the full piece.