From Prediction to Planning: WLA Unifies World Modeling, Language Reasoning, and Action Generation

The paper introduces the World‑Language‑Action (WLA) model, which replaces pixel‑level world‑action predictions with combined textual intent and fine‑grained physical dynamics, achieving 2 B‑parameter real‑time inference at 40 ms, doubling success rates on the RMBench benchmark and outperforming prior WAM and VLA baselines in simulation and real‑robot tests.

Machine Heart
Machine Heart
Machine Heart
From Prediction to Planning: WLA Unifies World Modeling, Language Reasoning, and Action Generation

Recent work on World‑Action Models (WAM) highlights three major drawbacks: excessive reconstruction of irrelevant visual details, high inference latency when generating images or video, and a lack of semantic information for long‑term planning.

To address these issues, the Shanghai Jiao‑Tong University DENG Lab proposes the World‑Language‑Action (WLA) model, which jointly models two key signals for future states: coarse‑grained textual intent and fine‑grained physical dynamics.

Textual intent is expressed in natural language, providing a concise, interpretable semantic representation that filters out unnecessary visual details and supports goal decomposition, memory organization, logical reasoning, and long‑term planning.

Physical dynamics describe how actions affect the environment, encoding object poses, contact relations, and motion trends. This bridges high‑level intent with low‑level control, allowing the robot to understand not only what to do but also the consequences of doing it.

The architecture uses an autoregressive Transformer backbone initialized from a pretrained vision‑language model (VLM). A "world expert" and a set of meta‑queries are appended to the input sequence; the world expert predicts future visual states conditioned on current observations and meta‑queries, producing latent representations that capture core physical dynamics without reconstructing low‑level details. These latent dynamics then condition an "action expert" that generates executable robot actions.

During inference, the world expert can be disabled, enabling direct action generation with a single 40 ms latency, thus avoiding the traditional "imagine‑then‑act" delay of pixel‑based WAM approaches.

Experimental evaluation on the RoboTwin 2.0 and LIBERO simulation benchmarks shows that WLA‑0, with only 2 B active parameters and no embodied pre‑training, achieves competitive results. On the long‑horizon RMBench benchmark, WLA‑0 attains a 56.5 % average success rate—almost twice that of the second‑best method (Mem‑0 at 28.5 %) and far above Fast‑WAM (13.3 %) and VLA baselines (5.5 %). An ablation removing the sub‑task prediction loss drops success to 17.3 %, confirming the importance of language reasoning for long‑term tasks.

Real‑robot experiments further demonstrate that WLA‑0 outperforms two pretrained baselines, including Motus, especially in dynamic tasks such as "Dispose Trash," where Motus suffers from higher inference latency. WLA‑0’s inference latency is only 1/40 of Motus, enabling timely tracking of rotating objects.

Finally, the authors explore cross‑embodiment transfer by training on 45 seen tasks and testing on 5 unseen tasks using video supervision. Adding unseen same‑embodiment video supervision raises success from ~12 % to 34.4 % (Clean) and 30.0 % (Random), while unseen cross‑embodiment video supervision still yields notable gains (28.8 % / 27.4 %). A case study on the Beat‑Block‑Hammer task shows that the model can learn to grasp a hammer and strike a target despite never seeing action annotations, highlighting the potential for learning from unlabeled videos.

All code, model weights, and the paper are fully open‑source (arXiv:2606.05979, GitHub repository https://github.com/SJTU-DENG-Lab/WLA, HuggingFace model hub).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Embodied AIBenchmarkingCross-embodiment TransferAction SynthesisLanguage ReasoningReal-time RoboticsWorld-Language-Action Model
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.