Embodied AI Unveiled: Ted Xiao Revisits Three Eras of Robot Learning from Google RT‑1/2 to SayCan

In a detailed interview, Ted Xiao, former Google DeepMind researcher, walks through the existence‑proof, foundation‑model, and scaling eras of embodied robot learning, explaining the technical challenges, pivotal decisions, and the evolving role of large language and vision models in robotics.

Machine Heart
Machine Heart
Machine Heart
Embodied AI Unveiled: Ted Xiao Revisits Three Eras of Robot Learning from Google RT‑1/2 to SayCan

Existence Proof Era

In 2015‑2016, breakthroughs such as DQN and AlphaGo demonstrated the power of end‑to‑end data‑driven methods, while robot hardware had already matured. Ted Xiao joined a small Google Brain robot team to test whether reinforcement learning could be applied directly to real robots. The team built a 24‑hour arm farm of KUKA manipulators and ran online RL, confronting the high‑dimensional continuous action space of real arms, which differed fundamentally from the discrete spaces of Atari and Go.

To handle this, they introduced QT‑Opt , using the cross‑entropy method (CEM) to approximate the Q‑value maximization in Bellman updates, enabling continuous control. QT‑Opt required a full system stack: concurrent RL, a CycleGAN to bridge simulation‑real visual gaps, and an evaluation pipeline. The arm‑farm experiments proved that end‑to‑end robot learning works in the real world.

After achieving reliable grasping, the team explored multi‑task learning. Notable projects included BC‑Z , the first large‑scale, language‑conditioned imitation‑learning policy, and MT‑OPT , a multi‑task extension of QT‑Opt. They also pursued Learning from Play , using hindsight experience relabeling on unstructured play data. However, both reinforcement learning and imitation learning hit diminishing returns, leading to a "Code Yellowish" period where research direction stalled.

During this pause, the team collected ~87,000 high‑quality tele‑operated trajectories in a miniature kitchen, betting on offline supervised learning. A rewrite of the training infrastructure by Yao Lu dramatically improved behavior cloning performance from a ~80% ceiling to 90‑95%, showing that large‑scale, high‑quality data could break the previous limits.

This "slow‑down to speed‑up" phase established data as the primary bottleneck and set the stage for the next era.

Foundation Model Era

Around 2022, large language models (LLMs) and vision‑language models (VLMs) began showing emergent capabilities, offering a "perfect storm" for robotics. The field shifted from online RL to offline large‑scale imitation learning, creating a window to introduce foundation models.

The first work, SayCan , used an LLM as a planner: given a natural‑language instruction, the model generated high‑level plans, while a learned value function evaluated the feasibility of each sub‑step. This combined commonsense reasoning with physical constraints.

Subsequently, RT‑1 tokenized language commands and image observations, outputting discrete robot‑action tokens at 3 Hz. Trained on 87 k trajectories covering ~500 tasks, RT‑1 outperformed all ResNet‑18 behavior‑cloning baselines, providing a reusable research infrastructure.

Building on this, the team used the VLM to relabel the dataset (DIAL), expanding task descriptions from hundreds to millions of language annotations, akin to hindsight relabeling but in language space.

The next leap, RT‑2 , reframed robot action prediction as a visual‑question‑answering task, turning the VLM into the core policy engine. Models ranging from 5 B to 55 B parameters exhibited new reasoning and generalization abilities beyond RT‑1.

Open X‑Embodiment unified data from 34 institutions, demonstrating cross‑embodiment skill transfer: behaviors learned on one robot could be zero‑shot transferred to others, especially for language‑described actions.

This era highlighted that leveraging external AI knowledge bases could dramatically accelerate robot research, turning the field from building everything from scratch to adapting powerful pre‑trained models.

Scaling Era

The scaling era, dubbed by Xiao, amplifies model size, data volume, and embodiment complexity simultaneously. DeepMind’s 2025 Gemini Robotics project (building on RT‑2) exemplifies this, integrating massive multimodal data and addressing VLM shortcomings in physical, spatial, and temporal reasoning.

Key innovations include Gemini Robotics ER, an enhanced VLM with embodied reasoning (3D detection, grasp angle prediction), and Gemini Robotics 1.5, which introduces a “think‑first” step where the policy plans in natural language before execution, mirroring recent LLM post‑training trends.

Scaling also brings motion‑transfer capabilities: a single network can transfer motions across disparate platforms (humanoid robots, Franka arms, ALOHA dual‑arm systems) without additional training.

Multiple dimensions now evolve in parallel: model architectures (post‑training fine‑tuning), evaluation pipelines (Sim‑to‑Real, RoboArena, world‑model verification), data strategies (egocentric human video, large‑scale interaction hours), and commercial data‑flywheel loops where deployed robots generate valuable training data.

Researchers now face open problems such as integrating video‑action models, leveraging first‑person human data, applying verifiable‑reward training, and reconciling manipulation versus locomotion paradigms.

World Models / Video Action Models: generative models for physical understanding

First‑person human data and sensor‑rich recordings

Verifiable reward training: bringing RLHF concepts to the physical world

Philosophical split between locomotion and manipulation

In the interview’s closing, Xiao splits the future of robot “ChatGPT” into product and technical timelines, suggesting a consumer‑grade embodied AI operating system may emerge within a few years, driven by advances in video‑action modeling and egocentric data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsEmbodied AIreinforcement learningimitation learningfoundation-modelsrobot learning
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.