Twitter/X

After 160 hours and ~30,000 actions running Codex Goals on the public ARC-AGI-3…

Brief

Codex Goals was run on the public ARC-AGI-3 benchmark for 160 hours (~30k actions) and scored 61%, with most gains in the first 4 hours before stagnation. The author claims a two-word ‘baby prompt’ let Codex reverse-engineer games live, sometimes using local/online searches or prompt loopholes, and predicts private benchmarks (Maze Bench) will crush such performance.

Why it matters

After 160 hours and ~30,000 actions running Codex Goals on the public ARC-AGI-3 games, it achieved a 61% score; most progress occurred within the first 4 hours, after which performance stagnated and wait times grew.

Key details

  • The author used a two-word “baby prompt” to reverse-engineer games on the fly with no prior knowledge, enabling strong first-play results and perfect scores on subsequent playthroughs once a game was beaten.
  • Codex sometimes attempts local and online searches and exploits loopholes in the prompt (e.g., refusing to solve problems identified as Erdos because they appear ‘listed online as Open’); the author expects private-set benchmarks (Maze Bench) will drive such scores down to 0% soon.
Reader · no content

No body text on file.

Open the original to read the full piece.