Epoch AI

MirrorCode: Evidence AI can already do some weeks-long coding tasks

Brief

MirrorCode is a new long-horizon benchmark (Epoch AI, co-developed with METR) that measures autonomous AI performance at reimplementing real CLI programs from an execute-only binary, visible documentation, and a battery of end-to-end tests. The benchmark purposefully disallows source or internet access and evaluates correctness by requiring identical outputs to the reference implementation across hundreds to thousands of test cases per target. MirrorCode uses a mix of original test suites, real-world corpora, and LLM-assisted test generation; to discourage hard-coded answers it pairs exposed tests with conceptually similar held-out “dual” tests. The authors plan an open-source release with a private test set and a suite of >20 programs spanning utilities, bioinformatics, serializers, interpreters, and more.

Using an Inspect/ReAct agent scaffold in a sandboxed Docker environment (no network), the authors ran four Claude Opus model generations (4.0, 4.1, 4.5, 4.6) with compaction to extend context and inference budgets up to 1 billion tokens (≈$550). Results show that recent models can autonomously reimplement substantial software when given a precise executable spec: Opus 4.6 effectively solved gotree (a ~16k LoC Go toolkit) by producing a ~7.6k LoC Rust reimplementation that passed 2,000/2,001 tests after ~280M tokens. By contrast, Pkl — a lazily-evaluated configuration language with ~61k LoC reference — remained unsolved after Opus 4.6 consumed its 1B-token budget; the agent wrote a 17.6k-line Rust interpreter but repeatedly implemented eager evaluation instead of a thunk-based lazy evaluator and consequently only passed ~35% of visible Pkl tests. The study also documents qualitative improvements across model generations: newer models chose more appropriate data structures (Edge-first graph for trees), better persevered instead of prematurely submitting (older models hallucinated time pressure and submitted early), and made steady gains with larger inference budgets. The authors highlight three main limitations — reliance on an unusually precise spec, contamination/memorization risk (screened via function-reproduction similarity), and limited coverage of software domains — and intend larger-budget runs and a broader target set to probe scaling behavior further.

Why it matters

MirrorCode (Epoch AI / METR), published 2026-04-10 by Tom Adamczewski et al., is a long-horizon coding benchmark that tasks agents to reimplement real CLI programs from execute-only binaries plus visible tests and documentation (no source or internet).

Key details

  • Claude Opus 4.6 fully reimplemented gotree (bioinformatics CLI): 7,644 LoC Rust reimplementation, evaluated on 2,001 end-to-end tests, scoring 2,000/2,001 (≈99.95%) in a run that used ~280M tokens and 2,989 messages; older Opus models scored 15% (4.0), 24% (4.1) and 63% (4.5) on best runs.
  • MirrorCode tested four analyzed targets (choose, cal, gotree, Pkl) with visible/hidden dual test design: choose (127 tests), cal (1,365 tests), gotree (2,001 tests), Pkl (770 tests). Pkl (reference ≈61k LoC Java/Kotlin) remained unsolved after a 1B-token budget run by Opus 4.6 (Rust run produced a 17,600-line interpreter and passed 256/733 visible tests, ~35%).
  • Agents ran in sandboxed Docker with no internet using an Inspect/ReAct scaffold, compaction to extend context, and inference budgets up to 1B tokens (~$550 in their setup); tests combine original test suites, real corpora, and LLM-assisted generation with hidden duals to deter hard-coding.
  • Key failure modes: memorization/contamination risk was actively screened via function-reproduction Levenshtein similarity (baseline ≈0.34); older models prematurely submitted (hallucinated time pressure) while Opus 4.6 persevered; architectural mistakes (e.g., choosing eager evaluation for Pkl) blocked full solutions.
  • Qualitative findings: Opus 4.5/4.6 used a graph/Edge-first representation for phylogenetic trees (better for topology edits) while 4.0/4.1 used parent/child models; Opus 4.6’s gotree code passed tests but contained human-quality issues (duplication, magic-values hack to work around __slots__).
Reader · no content

No body text on file.

Open the original to read the full piece.