Epoch AI

Trading off compute in training and inference

2023-07-28 · 00:00 UTC ·Pablo Villalobos; David Atkinson ·27 min read

Brief

Across techniques they find typical trade spans of order 1–2 OOM in each axis: (a) Chinchilla-style scaling can trade ~1.2 OOM training for ~0.7 OOM inference; (b) MCTS exhibits an S-curve where at low performance ~1 OOM training ↔ ~1.6 OOM inference (and the trade shrinks as models near perfect play); (c) IMP pruning to ≈10% density saved ≈1 OOM inference at ~0.7 OOM extra training because of multi-round re-training; and (d) resampling with cheap verification (e.g., code generation pass@k) extrapolates to much larger spans (authors estimate up to ~3 OOM training savings vs ~6 OOM inference increase) but caution low confidence in extrapolation. The authors categorize three tradeoff archetypes—saturating/constant span (e.g., Chinchilla), saturating/decreasing span (e.g., MCTS, limited n@k), and non-saturating/increasing span (resampling with effectively unlimited trials)—and show that some techniques combine roughly additively but often with diminishing returns, so realistic combined spans are ~2–3 OOM. Methodologically they fit empirical curves (smoothly broken power laws, log relations) and prioritize Pareto frontiers for optimal TC/IC choices. The paper emphasizes governance implications: because aggregate inference cost often exceeds training cost in deployed systems, commercial models will be biased toward low-IC configurations, but researchers or well-resourced actors can invest IC to simulate larger-model capabilities for evaluation, internal use, or small-customer deployments—affecting safety assessments and policy proposals aimed at controlling frontier capabilities.

Why it matters

Epoch AI (Pablo Villalobos & David Atkinson), published 2023-07-28, quantifies tradeoffs between training compute (TC) and per-inference compute (IC) across five techniques: model/data scaling, Monte Carlo Tree Search (MCTS), pruning (IMP), repeated sampling/resampling (pass@k / n@k), and chain-of-thought/cascades.

Key details

Varying the Chinchilla-style scaling policy (using TC ≈ 6ND and IC ≈ 2N) can trade ~1.2 orders of magnitude (OOM) extra training compute to reduce ~0.7 OOM of inference compute while holding performance constant (Hoffmann et al. 2022 fit used).
MCTS experiments replicated on Hex (extending Jones 2021) show an S-curve relation: at low performance one can trade ~1 OOM training for ~1.6 OOM inference, while at high performance the tradeoff inverts (≈1.6 OOM training buys only ~1.1 OOM inference) and disappears near perfect play.
Iterative Magnitude Pruning (IMP) can reduce inference compute by ~1 OOM (pruning to ~10% density) at the cost of increasing training compute by ~0.7 OOM due to repeated re-training rounds (e.g., 20% pruning per iteration → ≈4.5× extra training work to reach 10% density).
Repeated sampling with cheap automatic verification (AlphaCode-style pass@k) potentially permits very large tradeoffs — authors estimate up to ~6 OOM extra inference to save ~3 OOM training for coding/proof tasks — but they flag low confidence and limited-data extrapolation; with limited verified trials (n@k) the observed trade is ≈1 OOM training ↔ 1.45 OOM inference.
Practical implications: deployed models are economically incentivized to minimize per-inference cost (inference-dominant aggregate spend); however, actors with access to extra inference compute can simulate larger-model capabilities (e.g., GPT-4–level behaviours) for research or limited high-cost applications, affecting safety evaluation and governance strategies.

Reader · no content

No body text on file.

Open the original to read the full piece.