Goal-VLA: Generative Large‑Model World Model Enables Zero‑Shot Robot Manipulation (ICRA 2026)
Goal-VLA introduces an image‑generative vision‑language model as an object‑centric world model that decouples high‑level semantic reasoning from low‑level control, using a reflection‑through‑synthesis loop and spatial grounding to achieve around 60% zero‑shot success on both RLBench simulations and real‑world UFACTORY X‑ARM tasks.
Goal‑VLA Overview
Goal‑VLA is a decoupled hierarchical framework that uses an image‑generative visual‑language model (VLM) as an object‑centric world model. The target object’s desired pose in image space serves as the interface between high‑level visual‑language reasoning and low‑level motion execution, enabling zero‑shot robot manipulation without task‑specific fine‑tuning or paired action data.
Execution Pipeline
1. Goal State Reasoning
The textual VLM expands a short natural‑language command into a detailed prompt. Gemini 2.5 Flash‑image generates candidate goal images from this prompt. An iterative Synthesis‑Reflection loop validates each image: Grounded SAM segments the candidate object, overlays it onto the initial scene, and a Reflector VLM assesses physical feasibility. If the image is infeasible, the Reflector returns corrective feedback that guides regeneration. The loop terminates when an image passes validation or a maximum iteration count is reached.
2. Spatial Grounding
Pixel‑level semantic features from the current observation and the validated goal image are matched to establish 2‑D correspondences. Depth Anything V2 predicts depth maps for both frames; after depth alignment, the correspondences are lifted to 3‑D point clouds. The Umeyama algorithm solves the least‑squares problem to obtain the optimal rotation and translation that align the current scene to the goal pose.
3. Low‑level Policy
The derived object pose is passed to a contact module that samples collision‑free grasp poses on the object’s point cloud. Assuming the gripper‑object relative pose remains constant after grasp, the transformation computed in the spatial grounding step is applied to the gripper. A motion planner then generates a collision‑free trajectory from the current robot configuration to the target gripper pose.
Experimental Evaluation
Simulation (RLBench)
Eight RLBench tasks (100 trials each) were evaluated under a strict zero‑shot setting. Goal‑VLA achieved a 59.9% average success rate, compared with 26.0% for the key‑point‑based hierarchical model MOKA [4] and near‑zero performance for end‑to‑end models OpenVLA [2] and Pi0 [10] without fine‑tuning.
Real‑World Robot
Four tasks—tomato‑into‑pot, desk‑cleaning, precise weighing, and upright‑bottle placement—were tested on a 7‑DOF UFACTORY X‑ARM 7 robot. Goal‑VLA attained a 60% average success rate, markedly higher than all baselines, demonstrating that explicit 3‑D goal pose generation provides reliable spatial guidance in physical settings.
Ablation Study
Adding the input‑enhancement prompt increased success by 27.5%. Removing the full Synthesis‑Reflection loop (i.e., using a single generation step) reduced the baseline success from 40.0% to 83.8%. Allowing up to three reflection iterations further raised success to 88.8%, confirming the importance of visual feedback and self‑correction.
Conclusion
Goal‑VLA shows that an image‑generative VLM can serve as an object‑centric world model, that iterative synthesis‑reflection improves the physical plausibility of generated goals, and that decoupling high‑level reasoning from low‑level control enables robust zero‑shot manipulation across diverse tasks, environments, objects, and robot morphologies.
References
RT‑2 [1]
OpenVLA [2]
Language‑affordance grounding [3]
MOKA [4]
Voxposer [5]
Grounded SAM [6]
Depth Anything V2 [7]
Umeyama (2002) [8]
RLBench [9]
Pi0 [10]
Code example
[1] Zitkovich, Brianna, et al. "Rt-2: Vision-language-action models transfer web knowledge to robotic control." Conference on Robot Learning. PMLR, 2023.
[2] Kim, Moo Jin, et al. "Openvla: An open-source vision-language-action model." arXiv preprint arXiv:2406.09246 (2024).
[3] Ahn, Michael, et al. "Do as i can, not as i say: Grounding language in robotic affordances." arXiv preprint arXiv:2204.01691 (2022).
[4] Liu, Fangchen, et al. "Moka: Open-world robotic manipulation through mark-based visual prompting." arXiv preprint arXiv:2403.03174 (2024).
[5] Huang, Wenlong, et al. "Voxposer: Composable 3d value maps for robotic manipulation with language models." arXiv preprint arXiv:2307.05973 (2023).
[6] Ren, Tianhe, et al. "Grounded sam: Assembling open-world models for diverse visual tasks." arXiv preprint arXiv:2401.14159 (2024).
[7] Yang, Lihe, et al. "Depth anything v2." Advances in Neural Information Processing Systems 37 (2024): 21875-21911.
[8] Umeyama, Shinji. "Least-squares estimation of transformation parameters between two point patterns." IEEE Transactions on pattern analysis and machine intelligence 13.4 (2002): 376-380.
[9] James, Stephen, et al. "Rlbench: The robot learning benchmark & learning environment." IEEE Robotics and Automation Letters 5.2 (2020): 3019-3026.
[10] Black, Kevin, et al. "$\pi_0 $: A Vision-Language-Action Flow Model for General Robot Control." arXiv preprint arXiv:2410.24164 (2024).Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
