Scaling World Model Dynamics to Over a Thousand Steps in Two ICLR Papers
The article reviews two ICLR papers by Haoxin Lin that advance world‑model dynamics from single‑step bootstrapping to any‑step direct prediction, introduce structured uncertainty via backtracking, and achieve stable full‑horizon roll‑outs of over a thousand steps, dramatically improving both online and offline reinforcement‑learning performance.
World models aim to construct an internal environment that can simulate future trajectories. While recent work has rapidly improved state representation, the dynamics component—predicting how states evolve under actions—has lagged, limiting the model’s ability to serve as a true "internal simulator".
Two ICLR contributions
Lin Haoxin and the LAMDA reinforcement‑learning group published two consecutive ICLR papers: Any‑step Dynamics Model (ICLR 2025) and ADM‑v2 (ICLR 2026). The first paper introduces the Any‑step Dynamics Model (ADM), which replaces the traditional single‑step bootstrapping pipeline with a back‑tracking scheme that directly predicts a state several steps ahead from an earlier latent state and a sequence of actions.
From single‑step bootstrapping to any‑step direct prediction
Conventional dynamics models predict the next state given the current state and action, then feed that prediction back as input for the next step. This bootstrapping approach accumulates error: a single deviation propagates and amplifies across the rollout horizon, causing short‑term success but long‑term instability.
ADM instead selects a historical latent state, concatenates a multi‑step action sequence, and predicts the future state in one shot. By shortening the error‑propagation chain, ADM reduces drift and enables more reliable long‑horizon roll‑outs.
Structured uncertainty via back‑tracking
Instead of training an ensemble of separate dynamics models, ADM exploits the variance among predictions made with different back‑tracking lengths. When the model operates in well‑covered data regions, predictions across time‑scales are consistent; in sparse or out‑of‑distribution regions, they diverge, providing a built‑in uncertainty signal.
Empirical results for ADM
Using ADM, the authors built two RL agents:
ADMPO‑ON for online model‑based RL, which shows higher sample efficiency.
ADMPO‑OFF for offline model‑based RL, which outperforms strong baselines (BC, CQL, MOPO, MOBILE, etc.) on D4RL and NeoRL benchmarks.
Tables in the paper report consistent performance gains, confirming that better future prediction translates into better policy learning.
ADM‑v2: full‑horizon roll‑out up to a thousand steps
ADM‑v2 tackles the next question: can dynamics models support a full‑horizon roll‑out that approaches an entire episode? The paper demonstrates, for the first time in an offline RL setting, stable roll‑outs of over a thousand steps.
Key architectural changes separate state initialization (encoded once as a latent vector) from action‑driven evolution, removing repeated re‑encoding of the start state. This yields a cleaner recurrent loop that better handles multi‑step direct prediction.
PARoll – Parallel Any‑step Roll‑out
ADM‑v2 adds PARoll, a parallel any‑step roll‑out mechanism that maintains multiple prediction streams with different step lengths simultaneously. This not only speeds up inference but also generates uncertainty estimates naturally from the divergence among parallel predictions.
PARoll is the crucial component that moves ADM‑v2 from “conceptually able to predict long horizons” to “actually executing thousand‑step roll‑outs”.
Policy evaluation and learning with ADM‑v2
ADM‑v2 is applied to offline policy evaluation on the DOPE benchmark, where it surpasses existing evaluation methods and other dynamics‑model baselines. For offline policy learning, the ADM2PO‑fh agent achieves new state‑of‑the‑art results on D4RL and NeoRL, improving average performance by 4.6% and 12.8% respectively.
Importantly, unlike many methods whose performance degrades as rollout length increases, ADM‑v2 continues to benefit from longer horizons, indicating that the error accumulation problem has been substantially mitigated.
Implications
These two works illustrate a clear research trajectory: first prove that dynamics need not rely on single‑step bootstrapping (ADM), then demonstrate that a properly structured any‑step model can sustain full‑episode roll‑outs (ADM‑v2). This shifts world models from short‑range predictors toward genuine data‑driven simulators capable of long‑term planning, strategy evaluation, and embodied intelligence.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
