Why VLA Pioneers Are Abandoning Vision‑Language‑Action Models

Generalist AI’s GEN-1 model achieves over 99% success, 2‑3× speed gains with only a tenth of the data, and its founders argue that vision‑language‑action (VLA) models are merely a crutch, urging a shift toward goal‑driven, fully‑scratch training for physical AGI.

Machine Heart
Machine Heart
Machine Heart
Why VLA Pioneers Are Abandoning Vision‑Language‑Action Models

Last week Generalist AI unveiled the GEN-1 model, which delivers over 99% success rates on robot tasks, runs 2–3 times faster than its predecessors, and requires only one‑tenth of the data and fine‑tuning needed by earlier models.

Founded in 2024, Generalist AI is backed by investors such as NVIDIA and Boldstart Ventures. Its leadership includes CEO Pete Florence (formerly Google DeepMind), CTO Andrew Barry (formerly Boston Dynamics), and chief scientist Andy Zeng (formerly a DeepMind research scientist). The team previously released GEN-0, demonstrating that physical interaction data can be turned into predictable, scalable machine intelligence.

Following the GEN-1 launch, CEO Pete Florence published a blog post critiquing the current trend of vision‑language‑action (VLA) models. He claims that the very creators of the VLA concept now intend to abandon VLA and even the “world‑model” label, arguing that over‑emphasis on tool labels limits imagination toward physical AGI.

In GEN-1, roughly 99% of the parameters are trained from scratch.

This decision, once seen as reckless, reflects a two‑year‑long conviction: with sufficient data and full control over a base model, breakthroughs can be accelerated.

GEN-1 is not a simple fine‑tuned visual‑language model (VLM) nor a pure “world model.” It is a foundational model built natively for physical interaction, treating the model as a first‑class citizen.

Training from scratch remains the winning strategy when ample data and compute are available.

From 2023 to 2025, VLA models dominated the field; by early 2026, “world models” are expected to peak. The authors note that Generalist AI never categorizes its own model as VLA or world model, despite being co‑creators of the VLA concept and publishing robot‑related world‑model research since 2023.

Goal More Important Than Tool Labels

John Schulman’s comparison of “idea‑driven” versus “goal‑driven” research shows that the latter, which first defines a concrete outcome and then tackles obstacles, tends to be more effective. The authors argue that current world‑model research is idea‑driven and may not align with the ultimate goal.

The proposed ultimate goal is “fully zero‑shot” robot capability: robots should execute a wide variety of unseen tasks with high success and speed, without any task‑specific training data, a hallmark of full physical AGI.

To approach this, the authors suggest incremental milestones: start with a specific task (Task X) using a small amount of robot training data, then progressively reduce the required data while maintaining >99% success, aiming for a one‑hour data requirement across tasks.

Goal‑driven roadmaps have historically yielded broader impact, as illustrated by an early multimodal language model originally built for robotics that later excelled in medical diagnosis benchmarks.

How Far Can We Go?

Binary “either‑or” thinking (e.g., choosing between method A or B) is limiting. A deeper inquiry asks how far we can push under given constraints and which constraints can be relaxed. Historical examples include early robotics debates over focusing solely on perception versus control, and early‑2020s AI product managers insisting on bespoke models for each niche instead of leveraging large‑scale co‑training.

The Chinchilla paper exemplifies the power of questioning data‑efficiency assumptions, earning a NeurIPS Outstanding Paper award and influencing industry.

Supply‑side constraints are also evolving. While robot data scarcity was once a bottleneck, Generalist AI now possesses over 500,000 hours of physical interaction data, enabling exploration beyond VLA as a crutch.

During periods of limited robot data, visual‑language training serves as a helpful “crutch.” However, once data sufficiency is achieved, reliance on this crutch should be reconsidered.

Moving Toward Physical AGI

The authors emphasize that goals outweigh specific methods; under existing constraints, the optimal solution should be sought rather than being confined to predefined categories.

Since its inception, Generalist AI has aimed to reconstruct and rethink everything to advance embodied general AI (Physical AGI). GEN-1 embodies this vision: a fully scratch‑trained model built on the world’s largest physical interaction dataset, with meticulously designed architecture, training pipeline, and inference mechanisms, free from external decision‑making constraints.

GEN-1 showcases impressive capabilities: scaling laws in robotics, rapid adaptation to new environments and embodiments with only hours of data, and emergent improvisational intelligence from large‑scale pretraining—just the beginning of the journey toward physical AGI.

Vision-Language-ActionGEN-1Generalist AIGoal-driven researchPhysical AGI
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.