What Exactly Is a World Model? History, Technology, and the $10 B Bet
The article traces the two decades‑long, parallel research lines that birthed video world models—dreaming agents in reinforcement learning and learning physics from human video—explains how they converged in 2024‑2025, evaluates current capabilities and limitations, and analyzes the $10 billion investment landscape and strategic moves by NVIDIA, OpenAI, and others.
Two independent research lineages
The reinforcement‑learning lineage traces back to Kenneth Craik’s 1943 proposal that the brain builds a small internal model of reality, and Jürgen Schmidhuber’s 1990 formalisation of a differentiable environment model for neural networks. After a long lull, David Ha and Schmidhuber revived the idea in the 2018 World Models paper (VAE + MDN‑RNN + controller), demonstrating imagination‑only training that solved Car Racing and VizDoom. Danijar Hafner extended this line with PlaNet (2019) and the Dreamer series (2020‑2025). Dreamer V2 reached human‑level Atari performance; Dreamer V3 achieved zero‑shot dexterity in Minecraft (Nature 2025). DeepMind’s MuZero (2020) showed that a model predicting only reward and value, without pixel reconstruction, can master Go, Chess and Atari.
The computer‑vision lineage began with action‑conditional video prediction (Oh et al. 2015; Finn et al. 2016) for planning. It shifted to learning visual representations from massive egocentric video: R3M (Nair et al. 2022) pretrained on Ego4D enabled a Franka arm to learn a task from 20 demonstrations; VPT (Baker et al. 2022) pretrained on 70 k h of Minecraft videos and fine‑tuned to perform complex actions with few examples; EgoMimic (Kareer et al. 2025) combined human video demonstrations with robot data, improving performance by 34‑228 % and generalising to new objects and scenes.
Convergence into video world models (2024‑2025)
Two breakthroughs enabled the merger. First, interactive video models such as Genie and GameNGen, originally narrow prototypes in 2024, became capable of real‑time, action‑conditional generation after AR‑DiT (Yin et al. 2025) and Self‑Forcing (Huang et al. 2025) introduced autoregressive diffusion. Second, the robot community’s data scarcity was alleviated by pretraining on millions of hours of human video and fine‑tuning with a small amount of robot data.
The resulting video world models inherit (1) dream‑based planning from the RL line (learning dynamics, imagining futures, training policies inside imagined worlds) and (2) high‑fidelity video generation from the vision line.
Demonstrated capabilities
Automated‑driving simulation : Wayve’s GAIA world model and Waymo’s learned world models generate diverse driving scenarios for stress‑testing.
Games and entertainment : Decart’s Oasis, Genie 3 (24 fps, 720p) and GameNGen (20 fps) run fully interactive environments in real time.
Strategy evaluation : DreamDojo predicts robot policy success with Pearson r = 0.995, allowing ranking of 20 candidate policies without real‑world trials.
Synthetic training data : DreamGen (NVIDIA, 2025) fine‑tuned a video generator on a single tele‑operation demo; a humanoid robot then performed 22 new behaviours in unseen environments.
Sample‑efficient learning : DayDreamer (2022) let a quadruped learn to walk in one hour by imagining thousands of steps between real interactions.
Direct robot control : DreamZero jointly predicts future video and motor commands, reporting a 2× generalisation boost over VLA baselines; independent replication is pending.
Limitations include poor cross‑environment generalisation (e.g., Dreamer agents must be retrained from scratch on a new Atari game) and the fact that most production systems still rely on visual‑language‑action (VLA) models rather than pure world models.
Alternative approach: Joint Embedding Predictive Architecture (JEPA)
Yann LeCun and AMI Labs pursue JEPA, which avoids pixel reconstruction by predicting abstract representations. V‑JEPA 2 was pretrained on >1 M h of internet video and fine‑tuned on 62 h of robot data, achieving 80 % zero‑shot success on grasp‑place tasks.
Technical attributes of video world models
Causality : time flows forward; bidirectional generation violates this constraint.
Interactivity : real‑time response to actions; without it the system is a movie, not a simulator.
Persistence : continuity over minutes (Genie 3) but not yet hours.
Real‑time performance : current state‑of‑the‑art 10‑30 fps.
Physical accuracy : adherence to real‑world physics remains the hardest attribute to achieve.
NVIDIA’s open‑source physical‑AI stack
The stack starts with Cosmos Predict 2.5 (140 B‑parameter video foundation model, 2 B video clips), proceeds to DreamDojo (trained on 44,711 h of first‑person human video, r = 0.995 for strategy evaluation), then DreamZero (joint video‑action prediction, 7 Hz real‑time on Blackwell GB200), and includes EgoScale (scaling law linking human video duration to robot performance, R² = 0.9983). GR00T N2, a robot brain, is slated for release in late 2026. All components are released under Apache 2.0.
Funding landscape
Over the past 18 months more than $10 billion has been invested across four tiers: pure world‑model startups (e.g., AMI Labs, World Labs, Runway), robot‑foundation‑model firms that embed world‑model components (e.g., Skild, Physical Intelligence, Figure), platform providers (NVIDIA, DeepMind) and large‑tech pivots (OpenAI’s Sora, Tesla, xAI). Companies that use world models tend to raise more capital than those that build them, suggesting a shift toward internal development.
Code example
Schmidhuber, J. (1990). Making the World Differentiable. Technical Report FKI-126-90, TU Munich.
Craik, K. (1943). The Nature of Explanation. Cambridge University Press.
Oh, J. et al. (2015). Action-Conditional Video Prediction using Deep Networks in Atari Games. NeurIPS 2015.
Finn, C., Goodfellow, I. & Levine, S. (2016). Unsupervised Learning for Physical Interaction through Video Prediction. NeurIPS 2016.
Ha, D. & Schmidhuber, J. (2018). World Models. NeurIPS 2018. Interactive demos: worldmodels.github.io.
Hafner, D. et al. (2019). Learning Latent Dynamics for Planning from Pixels. ICML 2019. (PlaNet)
Schrittwieser, J. et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature. (MuZero)
Hafner, D. et al. (2020). Dream to Control: Learning Behaviors by Latent Imagination. ICLR 2020. (Dreamer V1)
Hafner, D. et al. (2021). Mastering Atari with Discrete World Models. ICLR 2021. (DreamerV2)
Nair, S. et al. (2022). R3M: A Universal Visual Representation for Robot Manipulation. CoRL 2022.
Kareer, S. et al. (2025). EgoMimic: Scaling Imitation Learning via Egocentric Video. ICRA 2025.
Baker, B. et al. (2022). Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos. NeurIPS 2022.
Wu, P. et al. (2022). DayDreamer: World Models for Physical Robot Learning. CoRL 2022. Project: danijar.com/project/daydreamer.
LeCun, Y. (2022). A Path Towards Autonomous Machine Intelligence. OpenReview.
Hafner, D. et al. (2025). Mastering Diverse Domains through World Models. Nature. (DreamerV3)
Bruce, J. et al. (2024). Genie: Generative Interactive Environments. ICML 2024. (Genie 1)
Yang, S. et al. (2024). Learning Interactive Real-World Simulators. ICLR 2024 Outstanding Paper. (UniSim). Project: universal-simulator.github.io/unisim.
Valevski, D. et al. (2024). Diffusion Models Are Real-Time Game Engines. (GameNGen). Project: gamengen.github.io.
Yin, T., Huang, X. et al. (2025). From Slow Bidirectional to Fast Autoregressive Video Diffusion Models. CVPR 2025. (AR-DiT / CausVid)
Huang, X. et al. (2025). Self Forcing. NeurIPS 2025.
Hafner, D. & Yan, W. (2025). Training Agents Inside of Scalable World Models. (Dreamer 4)
Jang, J. et al. (2025). DreamGen: Unlocking Generalization in Robot Learning through Video World Models. CoRL 2025.
Gao, S., Liang, W. et al. (2026). DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos. Project: dreamdojo-world.github.io.
Ye, S., Ge, Y. et al. (2026). DreamZero: World Action Models as Zero-shot Policies. Project: dreamzero0.github.io.
Zheng, K., Niu, D. et al. (2026). EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data.
Physical Intelligence. (2025). Pi-0.5: a Vision-Language-Action Model with Open-World Generalization.Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
