No body text on file.
Open the original to read the full piece.
MirrorCode is a new long-horizon benchmark (Epoch AI, co-developed with METR) that measures autonomous AI performance at reimplementing real CLI programs from an execute-only binary, visible documentation, and a battery of end-to-end tests. The benchmark purposefully disallows source or internet access and evaluates correctness by requiring identical outputs to the reference implementation across hundreds to thousands of test cases per target. MirrorCode uses a mix of original test suites, real-world corpora, and LLM-assisted test generation; to discourage hard-coded answers it pairs exposed tests with conceptually similar held-out “dual” tests. The authors plan an open-source release with a private test set and a suite of >20 programs spanning utilities, bioinformatics, serializers, interpreters, and more.
Using an Inspect/ReAct agent scaffold in a sandboxed Docker environment (no network), the authors ran four Claude Opus model generations (4.0, 4.1, 4.5, 4.6) with compaction to extend context and inference budgets up to 1 billion tokens (≈$550). Results show that recent models can autonomously reimplement substantial software when given a precise executable spec: Opus 4.6 effectively solved gotree (a ~16k LoC Go toolkit) by producing a ~7.6k LoC Rust reimplementation that passed 2,000/2,001 tests after ~280M tokens. By contrast, Pkl — a lazily-evaluated configuration language with ~61k LoC reference — remained unsolved after Opus 4.6 consumed its 1B-token budget; the agent wrote a 17.6k-line Rust interpreter but repeatedly implemented eager evaluation instead of a thunk-based lazy evaluator and consequently only passed ~35% of visible Pkl tests. The study also documents qualitative improvements across model generations: newer models chose more appropriate data structures (Edge-first graph for trees), better persevered instead of prematurely submitting (older models hallucinated time pressure and submitted early), and made steady gains with larger inference budgets. The authors highlight three main limitations — reliance on an unusually precise spec, contamination/memorization risk (screened via function-reproduction similarity), and limited coverage of software domains — and intend larger-budget runs and a broader target set to probe scaling behavior further.
MirrorCode (Epoch AI / METR), published 2026-04-10 by Tom Adamczewski et al., is a long-horizon coding benchmark that tasks agents to reimplement real CLI programs from execute-only binaries plus visible tests and documentation (no source or internet).
Open the original to read the full piece.