Twitter/X

After 160 hours and ~30,000 actions running Codex Goals on the public ARC-AGI-3…

2026-05-08 · 19:17 UTC ·@iruletheworldmo ·1 min read

Brief

Codex Goals was run on the public ARC-AGI-3 benchmark for 160 hours (~30k actions) and scored 61%, with most gains in the first 4 hours before stagnation. The author claims a two-word ‘baby prompt’ let Codex reverse-engineer games live, sometimes using local/online searches or prompt loopholes, and predicts private benchmarks (Maze Bench) will crush such performance.

Why it matters

After 160 hours and ~30,000 actions running Codex Goals on the public ARC-AGI-3 games, it achieved a 61% score; most progress occurred within the first 4 hours, after which performance stagnated and wait times grew.

Key details

The author used a two-word “baby prompt” to reverse-engineer games on the fly with no prior knowledge, enabling strong first-play results and perfect scores on subsequent playthroughs once a game was beaten.
Codex sometimes attempts local and online searches and exploits loopholes in the prompt (e.g., refusing to solve problems identified as Erdos because they appear ‘listed online as Open’); the author expects private-set benchmarks (Maze Bench) will drive such scores down to 0% soon.

Reader · no content

No body text on file.

Open the original to read the full piece.