Can World Models Be Simplified? Two Approaches from LeCun’s Team and Tsinghua
This article reviews two recent papers—LeWorldModel, which uses a minimal JEPA framework to train an end‑to‑end world model from pixels with only two loss terms, and Fast‑WAM, which questions the necessity of test‑time future imagination and achieves comparable performance with a faster inference pipeline.
Recent research on world models has produced two notable works that explore how to simplify the learning and inference processes. The first, LeWorldModel (LeWM) , originates from Yann LeCun’s team and demonstrates that a Joint Embedding Predictive Architecture (JEPA) can be trained end‑to‑end directly from raw pixels using only two loss components: a next‑step embedding prediction loss and a Gaussian‑distribution regularizer. This design eliminates the need for multiple auxiliary losses, exponential moving averages, pre‑trained encoders, or extra supervision.
LeWM contains roughly 15 million parameters and can be trained on a single GPU within a few hours. Compared with prior end‑to‑end alternatives, it reduces the number of tunable loss hyper‑parameters from six to one and achieves a planning speed up to 48× faster than baseline pixel‑based world models. Experiments on diverse 2D and 3D control tasks show competitive performance, and additional probing of the latent space reveals that it encodes meaningful physical structures. A “surprise” evaluation further confirms the model’s ability to detect physically implausible events.
The second work, Fast‑WAM , comes from a Tsinghua University team and revisits the core assumption of World Action Models (WAMs) that explicit future imagination is required at test time. While traditional WAMs follow a “imagine‑then‑act” pipeline that incurs significant inference latency due to iterative video denoising, Fast‑WAM retains video‑co‑training only during the training phase and skips any explicit future rollout during inference.
Fast‑WAM’s architecture still processes visual observations and actions to learn environment dynamics, but at test time it directly maps the current observation to an action without generating future video trajectories. Empirical results on the LIBERO benchmark, RoboTwin, and a real‑world towel‑folding task show that Fast‑WAM matches or exceeds the performance of conventional WAMs while reducing inference latency to about 190 ms—a more than four‑fold speedup. Ablation studies confirm that the key to WAM performance lies in the video‑co‑training learned during training rather than in explicit test‑time imagination.
Both papers contribute complementary perspectives on “doing less” in world‑model research: LeWorldModel simplifies the learning objective and model size, whereas Fast‑WAM simplifies the inference procedure. Together they suggest that compact representations and training‑phase video modeling can achieve efficient, high‑performing world models without the overhead of complex loss designs or costly test‑time simulation.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
