Not Boring by Packy McCormick

World Models: Computing the Uncomputable

2026-03-19 · 12:55 UTC ·Packy McCormick ·86 min read

Brief

World Models are presented here as the next major foundation-model paradigm for embodied AI: systems that learn the causal structure of environments from observations plus actions, then let agents train, plan, and act inside those learned environments. The central distinction is between passive next-frame prediction and intervention-aware next-state prediction. In the author’s framing, a video model predicts what likely comes next in a scene, while a World Model predicts what happens next because an agent did something. That action-conditioned loop is the essential technical move: it compresses highly complex, stochastic environments into a fixed-cost neural forward pass, rather than requiring explicit simulation of every object, person, and interaction. The claim is that this makes previously intractable domains—robotics, driving, dense multi-agent environments, and eventually much richer physical systems—computationally manageable in a way symbolic code or language alone cannot.

The essay places this idea in a longer lineage than current hype suggests. It traces conceptual roots back to Schmidhuber’s 1990 proposal to learn differentiable world simulators and Sutton’s 1991 Dyna architecture, both of which anticipated agents learning by interacting with internal models before hardware and datasets made the vision feasible. The modern revival begins with Ha and Schmidhuber’s 2018 'World Models' paper, which decomposed the problem into a vision model, memory model, and controller, and showed that an agent could train successfully in hallucinated game environments before transferring back to the real task. Since then, progress has unfolded in waves. SimPLe showed sample-efficient Atari learning with only 100,000 real steps. DreamerV2 used recurrent state-space models with discrete latents to reach human-level Atari performance across 55 games while training entirely in imagination on a single GPU. MuZero took a different route, learning latent internal dynamics and planning entirely in abstract hidden states, matching AlphaZero in Go, chess, and shogi while extending to 57 Atari games without explicit rules.

A major current debate is whether the best World Models should be latent or generative. The latent camp, associated with MuZero, JEPA, and Yann LeCun’s AMI, argues that predicting full pixels wastes capacity on inherently unpredictable detail; the model should instead predict abstract future representations that preserve what matters for planning. This can make planning dramatically faster, as illustrated by Meta’s V-JEPA 2: a 1.2B-parameter model pre-trained on over 1 million hours of video, then fine-tuned on 62 hours of robot data, which performed zero-shot planning on Franka robot arms in seconds. The tradeoff is visibility and debuggability: if the model reasons in latent space, humans cannot easily inspect failure modes or interact inside the imagined world. Generative approaches—IRIS, DIAMOND, GAIA-2, Genie, and related systems—accept higher computational cost in exchange for playable, human-observable rollouts. IRIS reframed world modeling as autoregressive prediction over visual tokens and was the first imagination-training approach to beat humans using the same two hours of gameplay data. DIAMOND later used diffusion rather than discrete autoregression and showed that richer visual detail translated into better downstream agent behavior, including a playable Counter-Strike-like neural environment trained from about 87 hours of footage on a single GPU.

The article argues that the field crossed from toy benchmarks into deployment between 2023 and 2026. Wayve’s 9B-parameter GAIA-1 showed that scaling laws seem to hold for visual World Models as they do for LLMs. GAIA-2 extended this with latent diffusion, flow matching, and space-time factorized transformers to simulate surround-view driving under controllable conditions including weather, road geometry, and other agents, allowing training on rare edge cases that would be hard or dangerous to collect from real roads. Comma.ai reportedly used a learned World Model to train a driving policy deployed into openpilot, outperforming imitation learning and classical simulators. Google DeepMind’s Genie learned action spaces from unlabeled 2D platformer video, while SIMA 2 used a Gemini backbone plus game-world training to produce an agent that can follow instructions, converse, and generalize across unseen virtual worlds. Meanwhile, the alternative robotics stack is VLAs: models such as RT-2 and Physical Intelligence’s π-series that leverage the enormous infrastructure built for language and vision-language models, then attach action heads for robot control. The essay’s conclusion is that these lines may converge, with World Models providing causal simulation and VLAs contributing semantic reasoning.

The most opinionated part of the piece concerns data strategy. It argues that action labels are the scarce resource in embodied AI and that inferring actions from generic video is still too lossy for hard edge cases. General Intuition’s thesis is that game data offers a uniquely scalable source of ground-truth human action traces. Through Medal, the company claims access to more than 1 billion gaming clips annually, complete with in-game action labels and metadata across tens of thousands of environments. That data, the authors argue, provides trillions of observe-predict-act examples and may be a better pretraining substrate for general embodied intelligence than narrow robotics demonstrations. They propose three transfer curves that determine whether such pretraining will work in reality—input modality transfer, sensor transfer, and environment transfer—and claim their game-controller-centric approach collapses two of those unknowns. Whether that bet pays off is unresolved, but the article makes clear that by March 2026 World Models have moved from a niche research concept to a heavily funded, technically diverse race spanning driving, robotics, simulation, and general-purpose agents.

Why it matters

The article defines a World Model as an action-conditioned predictive model of the form P(s_t+1 | s_t, a_t), contrasting it with video models that predict P(x_t+1 | x_t); the key claim is that adding actions lets a model simulate causality and interactivity rather than just generate plausible visuals.

Key details

The historical lineage runs from Jürgen Schmidhuber’s 1990 'Making the World Differentiable' and Richard Sutton’s 1991 Dyna architecture to David Ha and Schmidhuber’s 2018 'World Models' paper, which showed an agent could train inside a learned dream-like environment and transfer its policy back to real game environments.
The field progressed through four modern waves: early proof-of-concept systems like SimPLe on Atari 100k (2019), human-level latent-model agents such as DreamerV2 and MuZero (2020), interactive generative models such as Wayve’s 9B-parameter GAIA-1 (2023) and DIAMOND (2024), and 2025-2026 systems aimed at real-world deployment like comma.ai’s world-model-trained openpilot policy and Meta’s V-JEPA 2.
DreamerV2 became the first World Model agent to achieve human-level performance across the 55-game Atari benchmark while training entirely in imagination on a single GPU, while MuZero matched AlphaZero on Go, chess, and shogi and generalized to 57 Atari games by planning in abstract latent states rather than generating visible futures.
The article presents a core split between latent and generative World Models: Yann LeCun’s JEPA/AMI approach predicts abstract future representations without pixels for efficiency and planning speed, while generative systems like IRIS, DIAMOND, GAIA-2, and Genie prioritize pixel-level or video-level rollouts that humans and agents can inspect and interact with.
Meta’s V-JEPA 2 is cited as a leading latent-space result: a 1.2B-parameter model pre-trained on more than 1 million hours of video, then fine-tuned on just 62 hours of robot data from the Droid dataset, enabling zero-shot pick-and-place planning on Franka arms in seconds rather than the minutes required by pixel-space planners.

Cleaned source text

title: World Models: Computing the Uncomputable

author: Packy McCormick

content_type: article

publication: Not Boring by Packy McCormick

published: 2026-03-19T12:55:52+00:00

source_url: https://www.notboring.co/p/world-models

word_count: 19240