Artificial Intelligence 13 min read

Beyond VLA and World Models: Galaxy General Unveils LDA‑1B to Scale Embodied Data

LDA‑1B unifies world modeling and VLA in a latent dynamics action model, ingesting over 30 000 hours of heterogeneous embodied data via a five‑layer AstraData pipeline, employing a unified end‑effector space and quality‑based data allocation, and achieving state‑of‑the‑art success rates on RoboCasa‑GR1 while being fully open‑sourced.

Machine Heart

Apr 29, 2026

Beyond VLA and World Models: Galaxy General Unveils LDA‑1B to Scale Embodied Data

Background

Recent breakthroughs in embodied intelligence focus on scaling data efficiency for robots. Generalist AI’s GEN‑1 and Physical Intelligence’s π 0.7 both target the core problem of ingesting massive, heterogeneous real‑world data.

LDA‑1B Overview

The cross‑ontology Latent Dynamics Action Model (LDA‑1B) is a 1.6 B‑parameter foundation model that jointly learns world modeling and visual‑language‑action (VLA) in an implicit latent space. It is trained on more than 30 000 hours of diverse embodied data, including virtual, real, human‑recorded, low‑quality “dirty” data, and unlabeled video.

Data Pipeline (AstraData)

AstraData is a five‑layer data pyramid: internet‑scale data → human behavior data → multi‑ontology synthetic simulation data → real tele‑operation data → on‑robot autonomous data. Using this pipeline the team constructed the 30 K‑sample embodied interaction dataset EI‑30K for the 1.6 B‑parameter model.

Unified End‑Effector Space

All recordings are converted to a standard LeRobot format. A unified end‑effector action space maps 6‑DoF end‑effector poses (plus gripper width or MANO hand parameters) across different hardware, enabling cross‑body generalization.

Quality‑Based Data Allocation

High‑quality action data participates fully in policy learning and dynamics training.

Sub‑optimal / noisy action data is excluded from policy learning but used for dynamics and visual prediction; adding 30 % of such trajectories improves task success by 10 %.

Action‑free video (first‑person videos) is fed to visual prediction, allowing the model to learn physical priors without action labels.

Three Systematic Unifications

Unified task form : policy learning, forward dynamics, inverse dynamics, and visual prediction are all cast as predicting future state + future action.

Unified representation space : DINO latent features replace pixel‑level reconstruction, providing background‑invariant but object‑semantic and geometry‑sensitive embeddings.

Unified model architecture : a multimodal Diffusion Transformer (MM‑DiT) processes action sequences and future visual tokens with shared attention, allowing the “thinking” process to be shared across tasks.

Model Architecture (MM‑DiT)

MM‑DiT incorporates Task Embedding and Register Token mechanisms to switch among the four capabilities within a single network. Shared attention between action and visual streams enables simultaneous prediction of world evolution and the actions that cause it, embedding causal reasoning in the attention structure.

Experimental Results

On the RoboCasa‑GR1 benchmark LDA‑1B achieves 55.4 % success, surpassing GR00T‑N1.6 (47.6 %) and π 0.5. An ablation that replaces the DINO latent space with a VAE pixel‑level reconstruction drops success to 20.0 %, confirming the importance of the latent space for scaling.

Real‑World Demonstrations

In tests with Galbot and Unitree robots, LDA‑1B shows strong few‑shot cross‑body generalization, adapting to new hardware with roughly one hour of post‑training data. It succeeds on long‑horizon tasks such as multi‑step stacking, dynamic object manipulation, and high‑precision dexterous tasks (e.g., flipping a steak), demonstrating on‑the‑fly error correction.

Future Directions

Planned extensions include end‑to‑end joint learning of visual features and latent dynamics, incorporation of additional perception modalities, and automated optimization of data‑quality allocation during training.

Publication and Resources

Paper: LDA‑1B: Scaling Latent Dynamics Action Model via Universal Embodied Data Ingestion , RSS 2024 (210 papers). arXiv: https://arxiv.org/abs/2602.12215. Project page: https://pku-epic.github.io/LDA/. Code repository: https://github.com/jiangranlv/LDA-1B.

embodied AI robotics scaling law data ingestion multimodal diffusion latent dynamics

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.