GigaWorld-Policy Boosts Inference Speed 10× and Success Rate 30%
The newly released GigaWorld-Policy world‑action model replaces traditional video‑prediction‑heavy WAM designs with an action‑centered architecture, achieving a ten‑fold inference speedup, ten‑fold training efficiency gain, and a 30% increase in real‑robot task success rate while reducing memory usage compared with Motus and Cosmos‑Policy.
GigaWorld-Policy, a new world‑action model announced by GigaAI, directly tackles the latency and training difficulty of existing embodied large models. It reports a ten‑fold inference speedup, ten‑fold training efficiency improvement, and a 30% absolute increase in real‑robot task success rate, setting a new performance ceiling for mainstream WAM models.
Traditional WAM architectures are burdened by a tightly coupled video‑prediction branch that must generate future visual frames together with actions during inference, leading to high computational delay. GigaWorld-Policy breaks this bottleneck by adopting an Action‑Centered paradigm.
The model builds on the lightweight GigaWorld‑0.5 world model and unifies visual observations, robot state, and action sequences into a single embedding space. A single Transformer backbone jointly models these modalities, eliminating the modal fragmentation of multi‑branch designs.
Training phase – “加码” (add‑on): A causal‑mask mechanism integrates action tokens with future visual tokens, allowing the action prediction task to benefit from dense supervision provided by future video dynamics.
Inference phase – “减负” (lighten): The heavy video‑prediction branch is removed, leaving only a lightweight action‑generation module, which cuts structural computational redundancy.
Compared with current leading models such as Motus and Cosmos‑Policy, GigaWorld-Policy maintains high‑quality policy output while delivering a ten‑fold inference speed increase and substantially lower GPU memory consumption, paving the way for large‑scale industrial deployment.
The model’s training pipeline consists of three stages:
Universal physical‑world pre‑training: Massive internet video data are used to endow GigaWorld‑0.5 with a basic understanding of general physical laws and visual dynamics.
Embodied scenario immersive fine‑tuning: Thousands of hours of first‑person, real‑robot, and simulated operation videos are introduced, enabling the model to master spatio‑temporal patterns specific to embodied interaction.
Tiny‑sample action alignment: With the strong world knowledge already acquired, only a small amount of real‑robot action‑label data is needed to align the pretrained world model with precise action prediction, establishing a causal mapping from observation to action to future vision.
This hierarchical “large‑scale pre‑training + task‑specific tiny‑sample fine‑tuning” paradigm yields an overall training efficiency gain of ten times compared with traditional VLA training schemes.
Real‑robot evaluations covering grasping, assembly, and object sorting tasks show that GigaWorld-Policy achieves an average success rate close to 85%, a more than 30% absolute improvement over competitors such as Cosmos‑Policy, and outperforms even ultra‑fast Pi series models. The millisecond‑level response enabled by the 10× inference speed is crucial for handling dynamic disturbances and execution errors, forming the foundation of its high success rate.
In summary, GigaWorld-Policy reconstructs the paradigm of embodied policy learning by using dense future‑visual supervision during training and a lightweight action‑only inference path, making world‑model‑driven robotics more practical and valuable for real‑time, high‑efficiency control.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
