Jun 23, 2026 · Artificial Intelligence

Can VLA‑JEPA Achieve Robust Vision‑Language‑Action with Few Robot Trajectories and Lots of Human Video?

The article analyzes VLA‑JEPA, a JEPA‑style pre‑training framework that combines limited robot trajectories with abundant human video to build a latent world model for Vision‑Language‑Action tasks, showing improved robustness and high success rates across simulated and real‑robot benchmarks.

VLA-JEPAbenchmarklatent world modeling

0 likes · 12 min read

Can VLA‑JEPA Achieve Robust Vision‑Language‑Action with Few Robot Trajectories and Lots of Human Video?