Epoch AI

How to run SWE-bench Verified in one hour on one machine

2025-02-07 · 00:00 UTC ·Tom Adamczewski ·17 min read

Brief

Epoch AI released a public registry of Docker images for SWE-bench and applied targeted layering optimizations to cut storage and runtime costs. By reorganizing how images are built (three nominal stages: base, env, instance) and reusing large shared artifacts in deeper layers, the team reduced the unique-layer footprint of 2290 x8664 images to 67 GiB (from 684 GiB) and the 500-image SWE-bench Verified set to 30 GiB (from 189 GiB). Concrete changes included moving git clone operations into env so multiple instances share the .git history (example: a Django instance layer fell from 330 MB to 40 MB because .git accounted for 291 MB) and relocating heavy apt preinstalls (matplotlib) to env to shrink a 1.9 GB instance layer to ~110 MB. They also disabled pip caching across images with PIPNOCACHE_DIR=0 to remain compatible with ancient pip versions created by conda for older Python targets.

With the registry, Epoch ran SWE-bench Verified on a single GitHub Actions VM (32 cores, 128 GB RAM) in 62–73 minutes across several major models (gemini-2.0-flash-001: 62m; claude-3-7-sonnet-20250219: 63m; gpt-4o-2024-11-20: 70m), achieving roughly 8 seconds per sample and sustaining ~2M tokens/min under a 300k-token-per-sample cap. Images are hosted on GitHub Container Registry under ghcr.io/epoch-research/swe-bench.eval.., all tagged latest; x8664 coverage is 2290/2294 and arm64 is provided best-effort for 1819 images. Source commits and sizing scripts (getregistry_size.py) are available in Epoch’s SWE-bench repository.

Why it matters

Epoch AI published a public GitHub Container Registry of SWE-bench Docker images: 2290 x86_64 images built (2290/2294, 99.8%) and all 500 SWE-bench Verified images; registry optimized to 67 GiB for the full set (10x reduction from 684 GiB of unique original layers) and 30 GiB for the Verified subset (6x reduction from 189 GiB).

Key details

Key optimizations exploited Docker layer caching: moving repo clones into the shared env layer (example: a Django instance top layer dropped from 330 MB to 40 MB because .git history was shared—291 MB), and moving heavy apt installs (matplotlib) into env reduced a 1.9 GB top layer to 110 MB (17x reduction).
Disabled pip caches across images using PIP_NO_CACHE_DIR=0 to remain compatible with very old pip versions bundled by conda (some images use pip <19); this avoided pipeline breakage while preventing cache bloat.
Using the optimized registry, SWE-bench Verified ran in 62–73 minutes on a single GitHub Actions runner (32 cores, 128 GB RAM) for major models: gemini-2.0-flash-001 = 62 min, claude-3-7-sonnet-20250219 = 63 min, gpt-4o-2024-11-20 = 70 min (≈8 seconds/sample, ~2M tokens/min sustained, 300k token per-sample cap used).
Registry details: images named ghcr.io/epoch-research/swe-bench.eval.<arch>.<instance_id> (all tagged latest); arm64 builds provided best-effort for 1819/2294 images; reproducibility scripts (get_registry_size.py) and full commits are in Epoch's SWE-bench repo.

Reader · no content

No body text on file.

Open the original to read the full piece.