Artificial Intelligence 14 min read

How ABot‑M0 Achieves Generalist Robot Intelligence with Action Manifold Learning

ABot‑M0 tackles the three long‑standing "Babel Tower" challenges of embodied AI—data fragmentation, inconsistent representations, and training mismatches—by releasing the massive UniACT dataset, introducing Action Manifold Learning for direct action prediction, and designing a plug‑and‑play dual‑path perception architecture that outperforms prior models on multiple robot benchmarks.

Amap Tech

Feb 13, 2026

How ABot‑M0 Achieves Generalist Robot Intelligence with Action Manifold Learning

Key Challenges

Embodied AI aims to turn large language models into generalist robot brains, but three fundamental obstacles have hindered progress: (1) Data islands —robotic data are expensive to collect and come in heterogeneous formats (joint angles vs. end‑effector poses, single‑arm vs. dual‑arm setups); (2) Representation chaos —different robots use different action spaces, coordinate systems, and control frequencies, forcing models to memorize many "dialects"; (3) Training mismatch —most visual‑language‑action (VLA) models fine‑tune from vision‑language models that excel at semantic recognition but lack precise 3D spatial awareness.

Core Contributions

To overcome these obstacles, Amap CV Lab and Alibaba jointly released ABot‑M0 , a VLA foundation model rebuilt along three dimensions:

Data layer : Construction of UniACT , the largest open‑source heterogeneous robot dataset (over 6 million trajectories, 9 500 hours of video, >20 robot morphologies). The dataset integrates six public sources, cleanses roughly 16 % of noisy samples, and standardises formats.

Representation layer : Introduction of the Action Manifold Learning (AML) paradigm, which assumes successful robot motions lie on a low‑dimensional smooth manifold and shifts prediction from noise‑denoising to direct action generation.

Architecture layer : A plug‑and‑play dual‑path perception design that fuses semantic features from a Vision‑Language Model (VLM) with geometric priors from 3D modules (VGGT and Qwen‑Image‑Edit) without modifying the VLM backbone.

Data Governance Pipeline

The raw open‑source data are highly noisy (multilingual task descriptions, low‑frame‑rate videos, missing or ambiguous pose data). A three‑stage cleaning pipeline was built:

Language purification : Machine‑translate all instructions to a unified multilingual format and fill missing sub‑task descriptions.

Visual quality inspection : Remove black frames, severe motion blur, and occluded views; discard invalid camera angles.

Action verification : Filter trajectories of abnormal length, eliminate samples with abrupt action jumps, and enforce a unified rotation representation (rotation vectors) to avoid gimbal lock.

Unified Principles for "One Brain, Many Forms"

Action representation unification : Use delta actions plus end‑effector coordinates; represent rotations with 7‑dimensional vectors [Δx, Δy, Δz, r, gripper].

Single‑/dual‑arm unification : Apply zero‑padding so single‑arm data are treated as right‑arm actions, allowing the model to always output a 14‑dimensional dual‑arm vector.

Task‑uniform sampling : Sample uniformly across tasks rather than trajectories, improving long‑tail skill coverage and cross‑morphology generalisation.

Action Manifold Learning

AML posits that feasible action sequences occupy a low‑dimensional manifold constrained by physics, task goals, and environment. The model uses a Diffusion Transformer (DiT) as the action generator and predicts clean action chunks directly ( \hat{A}_t) instead of noise or velocity. The loss is computed as MSE in velocity space but transformed via the Jacobian to retain the benefits of flow matching.

During inference, the model starts from pure noise and iteratively denoises using an ODE solver: each step predicts an action chunk, converts it to velocity, and updates the state, keeping the focus on the intrinsic structure of motions.

Dual‑Path Perception

While VLMs provide strong semantic understanding, they lack precise 3D geometry. Two optional 3D modules are introduced:

VGGT : Extracts 3D features from a single RGB image and fuses them with VLM features via cross‑attention (outperforming simple concatenation or Q‑Former).

Qwen‑Image‑Edit : Generates additional viewpoints to implicitly capture 3D layout; fine‑tuning on just 50 paired samples yields a 14 % accuracy boost under camera perturbations.

Both modules are plug‑and‑play, leave the VLM backbone untouched, and can be stacked for flexible deployment.

Experimental Results

ABot‑M0 was evaluated on four major benchmarks and consistently outperformed strong baselines:

LIBERO (single‑arm) : 98.6 % average success rate, surpassing π0.5 (96.9 %) and OpenVLA‑OFT (97.1 %).

LIBERO‑Plus (robustness) : 80.5 % success under camera, morphology, and language perturbations, setting a new state‑of‑the‑art.

RoboCasa GR1 (dual‑arm desktop tasks) : 58.3 % success in high‑dimensional action space, far above GR00T‑N1.6 and OpenVLA‑OFT.

RoboTwin 2.0 (cross‑scene generalisation) : >80 % success in both clean and randomised scenes, markedly better than π0.5 and X‑VLA.

Long‑sequence prediction experiments on LIBERO‑Plus showed that traditional v‑prediction models (e.g., GR00T) suffer severe performance drops when action chunk length increases, whereas AML’s a‑prediction retains >60 % success, demonstrating robustness to longer horizons and higher‑dimensional action spaces.

Future Outlook

Scale data further by incorporating human‑demonstration trajectories and self‑evolving data engines (model execution → failure analysis → data augmentation → model update).

Integrate additional modalities such as force, tactile, and temperature sensing for truly multimodal perception.

Move 3D representation from post‑hoc injection to pre‑training acquisition via self‑supervised depth and pose estimation.

Extend the "one brain, many forms" vision to legged robots, drones, and full‑size humanoids, abstracting hardware specifics to learn universal physical interaction laws.

Conclusion

ABot‑M0 demonstrates that systematic data engineering, a principled action‑manifold formulation, and modular dual‑path perception can deliver high‑performance, generalist embodied intelligence without relying on private datasets or specialised hardware, paving the way toward truly open‑source, scalable robot AI.