No body text on file.
Open the original to read the full piece.
Across techniques they find typical trade spans of order 1–2 OOM in each axis: (a) Chinchilla-style scaling can trade ~1.2 OOM training for ~0.7 OOM inference; (b) MCTS exhibits an S-curve where at low performance ~1 OOM training ↔ ~1.6 OOM inference (and the trade shrinks as models near perfect play); (c) IMP pruning to ≈10% density saved ≈1 OOM inference at ~0.7 OOM extra training because of multi-round re-training; and (d) resampling with cheap verification (e.g., code generation pass@k) extrapolates to much larger spans (authors estimate up to ~3 OOM training savings vs ~6 OOM inference increase) but caution low confidence in extrapolation. The authors categorize three tradeoff archetypes—saturating/constant span (e.g., Chinchilla), saturating/decreasing span (e.g., MCTS, limited n@k), and non-saturating/increasing span (resampling with effectively unlimited trials)—and show that some techniques combine roughly additively but often with diminishing returns, so realistic combined spans are ~2–3 OOM. Methodologically they fit empirical curves (smoothly broken power laws, log relations) and prioritize Pareto frontiers for optimal TC/IC choices. The paper emphasizes governance implications: because aggregate inference cost often exceeds training cost in deployed systems, commercial models will be biased toward low-IC configurations, but researchers or well-resourced actors can invest IC to simulate larger-model capabilities for evaluation, internal use, or small-customer deployments—affecting safety assessments and policy proposals aimed at controlling frontier capabilities.
Epoch AI (Pablo Villalobos & David Atkinson), published 2023-07-28, quantifies tradeoffs between training compute (TC) and per-inference compute (IC) across five techniques: model/data scaling, Monte Carlo Tree Search (MCTS), pruning (IMP), repeated sampling/resampling (pass@k / n@k), and chain-of-thought/cascades.
Open the original to read the full piece.