Goal-VLA: Generative Large‑Model World Model Enables Zero‑Shot Robot Manipulation (ICRA 2026)

Goal-VLA introduces an image‑generative vision‑language model as an object‑centric world model that decouples high‑level semantic reasoning from low‑level control, using a reflection‑through‑synthesis loop and spatial grounding to achieve around 60% zero‑shot success on both RLBench simulations and real‑world UFACTORY X‑ARM tasks.

Machine Heart
Machine Heart
Machine Heart
Goal-VLA: Generative Large‑Model World Model Enables Zero‑Shot Robot Manipulation (ICRA 2026)

Goal‑VLA Overview

Goal‑VLA is a decoupled hierarchical framework that uses an image‑generative visual‑language model (VLM) as an object‑centric world model. The target object’s desired pose in image space serves as the interface between high‑level visual‑language reasoning and low‑level motion execution, enabling zero‑shot robot manipulation without task‑specific fine‑tuning or paired action data.

Execution Pipeline

1. Goal State Reasoning

The textual VLM expands a short natural‑language command into a detailed prompt. Gemini 2.5 Flash‑image generates candidate goal images from this prompt. An iterative Synthesis‑Reflection loop validates each image: Grounded SAM segments the candidate object, overlays it onto the initial scene, and a Reflector VLM assesses physical feasibility. If the image is infeasible, the Reflector returns corrective feedback that guides regeneration. The loop terminates when an image passes validation or a maximum iteration count is reached.

2. Spatial Grounding

Pixel‑level semantic features from the current observation and the validated goal image are matched to establish 2‑D correspondences. Depth Anything V2 predicts depth maps for both frames; after depth alignment, the correspondences are lifted to 3‑D point clouds. The Umeyama algorithm solves the least‑squares problem to obtain the optimal rotation and translation that align the current scene to the goal pose.

3. Low‑level Policy

The derived object pose is passed to a contact module that samples collision‑free grasp poses on the object’s point cloud. Assuming the gripper‑object relative pose remains constant after grasp, the transformation computed in the spatial grounding step is applied to the gripper. A motion planner then generates a collision‑free trajectory from the current robot configuration to the target gripper pose.

Experimental Evaluation

Simulation (RLBench)

Eight RLBench tasks (100 trials each) were evaluated under a strict zero‑shot setting. Goal‑VLA achieved a 59.9% average success rate, compared with 26.0% for the key‑point‑based hierarchical model MOKA [4] and near‑zero performance for end‑to‑end models OpenVLA [2] and Pi0 [10] without fine‑tuning.

Real‑World Robot

Four tasks—tomato‑into‑pot, desk‑cleaning, precise weighing, and upright‑bottle placement—were tested on a 7‑DOF UFACTORY X‑ARM 7 robot. Goal‑VLA attained a 60% average success rate, markedly higher than all baselines, demonstrating that explicit 3‑D goal pose generation provides reliable spatial guidance in physical settings.

Ablation Study

Adding the input‑enhancement prompt increased success by 27.5%. Removing the full Synthesis‑Reflection loop (i.e., using a single generation step) reduced the baseline success from 40.0% to 83.8%. Allowing up to three reflection iterations further raised success to 88.8%, confirming the importance of visual feedback and self‑correction.

Conclusion

Goal‑VLA shows that an image‑generative VLM can serve as an object‑centric world model, that iterative synthesis‑reflection improves the physical plausibility of generated goals, and that decoupling high‑level reasoning from low‑level control enables robust zero‑shot manipulation across diverse tasks, environments, objects, and robot morphologies.

References

RT‑2 [1]

OpenVLA [2]

Language‑affordance grounding [3]

MOKA [4]

Voxposer [5]

Grounded SAM [6]

Depth Anything V2 [7]

Umeyama (2002) [8]

RLBench [9]

Pi0 [10]

Code example

[1] Zitkovich, Brianna, et al. "Rt-2: Vision-language-action models transfer web knowledge to robotic control." Conference on Robot Learning. PMLR, 2023.
[2] Kim, Moo Jin, et al. "Openvla: An open-source vision-language-action model." arXiv preprint arXiv:2406.09246 (2024).
[3] Ahn, Michael, et al. "Do as i can, not as i say: Grounding language in robotic affordances." arXiv preprint arXiv:2204.01691 (2022).
[4] Liu, Fangchen, et al. "Moka: Open-world robotic manipulation through mark-based visual prompting." arXiv preprint arXiv:2403.03174 (2024).
[5] Huang, Wenlong, et al. "Voxposer: Composable 3d value maps for robotic manipulation with language models." arXiv preprint arXiv:2307.05973 (2023).
[6] Ren, Tianhe, et al. "Grounded sam: Assembling open-world models for diverse visual tasks." arXiv preprint arXiv:2401.14159 (2024).
[7] Yang, Lihe, et al. "Depth anything v2." Advances in Neural Information Processing Systems 37 (2024): 21875-21911.
[8] Umeyama, Shinji. "Least-squares estimation of transformation parameters between two point patterns." IEEE Transactions on pattern analysis and machine intelligence 13.4 (2002): 376-380.
[9] James, Stephen, et al. "Rlbench: The robot learning benchmark & learning environment." IEEE Robotics and Automation Letters 5.2 (2020): 3019-3026.
[10] Black, Kevin, et al. "$\pi_0 $: A Vision-Language-Action Flow Model for General Robot Control." arXiv preprint arXiv:2410.24164 (2024).
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

generative VLMGoal-VLAobject-centric world modelRLBenchUFactory X-ARMzero-shot robot manipulation
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.