OpenHLM Enables Whole‑Body Loco‑Manipulation for Humanoid Robots

OpenHLM presents an open‑source VLA recipe that lets humanoid robots coordinate arms, torso, legs, and feet under vision‑language commands, using decoupled whole‑body teleoperation, multi‑step flow generation, and low‑cost data sources to achieve superior whole‑body loco‑manipulation performance on the HLM‑12 benchmark.

Machine Heart
Machine Heart
Machine Heart
OpenHLM Enables Whole‑Body Loco‑Manipulation for Humanoid Robots

Motivation

Human daily activities require coordinated whole‑body motions such as squatting to pick low objects, stepping on a trash‑can pedal, and synchronizing arm grasping with leg locomotion. For humanoid robots this implies that the robot must be treated as an integrated motion system rather than a simple "arm + mobile platform".

Key System Requirements

Full‑body involvement – the VLA policy must be able to activate arms, torso, knees and feet, enabling actions like squatting to retrieve objects or using a foot to press a pedal.

Language‑driven – a single model should execute diverse tasks from different language commands without retraining per task.

Low‑cost data extensibility – besides expensive full‑body teleoperation data, the system should leverage cheaper sources such as stationary teleoperation or HuMI (hand‑held demonstrations without a robot) to expand capabilities.

Experiment 1 – Whole‑Body Controller and Teleoperation Interface

The design of the teleoperation interface determines which degrees of freedom are exposed to the model. Three factors were found to produce the most effective data for subsequent VLA training:

Decoupled upper‑body / lower‑body control, allowing independent manipulation of torso/arms and legs.

A VR 3‑point interface that captures head, hand and foot positions.

A high‑dimensional SMPL‑based human pose representation that maps directly to robot joint space (joint‑based full‑body teleoperation).

These choices maximize the richness of the supervision signal for whole‑body learning.

Experiment 2 – Transferring Existing VLA to Humanoid Action Space

Many Vision‑Language‑Action (VLA) models are pretrained on fixed‑arm or wheeled dual‑arm platforms, whose action spaces are lower‑dimensional than those of humanoids. The study evaluated three design dimensions:

Pretraining on non‑humanoid data still provides useful priors for humanoid tasks.

The specific action format (e.g., joint values vs. end‑effector poses) and the body‑perception input (e.g., proprioceptive signals) have limited impact on final performance; no single choice is a bottleneck.

A multi‑step flow that generates actions sequentially outperforms a single‑step generation scheme.

OpenHLM adopts the following recipe based on these findings: retain non‑humanoid pretraining, keep body‑perception inputs, output absolute joint values, and employ a multi‑step flow for action synthesis.

Experiment 3 – Low‑Cost Data for Task Expansion

Full‑body teleoperation yields high‑quality supervision but is costly and slow to scale. OpenHLM introduces two cheaper data streams:

Stationary teleoperation, where the robot remains fixed while the operator demonstrates upper‑body motions.

HuMI (Hand‑held Manipulation Imitation), a hand‑held demonstration dataset collected without a robot.

Joint training with these sources enables the VLA to generalize to new objects and commands. Although HuMI data exhibit a visual and motion domain gap relative to real robot data, they still improve performance on novel tasks. Extending to entirely new motion patterns remains challenging under the current setup.

Benchmark – HLM‑12 Task Suite

The HLM‑12 benchmark comprises twelve tasks covering four categories of whole‑body loco‑manipulation:

Basic walk‑and‑place combinations.

Torso‑extended reach (e.g., squatting to pick up low objects).

Foot‑based actions (e.g., stepping on a pedal before placing an item).

Constrained‑environment operations (e.g., pushing a cart while holding the handle).

This suite provides a comprehensive evaluation platform for assessing full‑body capabilities.

Final Comparison with Baselines

A long‑horizon language‑conditioned task requires the robot to fetch specified fruits from two tables of different heights and place them on high shelves, involving repeated walking, posture adjustment, grasping, placing, turning, and high‑reach actions. Results:

OpenHLM trained with HuMI data completes the task in less than half the time of strong baselines GR00T N1.6 and Ψ0.

Average task progress: OpenHLM 87.5 % vs. GR00T N1.6 57.5 % and Ψ0 48.8 %.

Performance approaches the oracle full‑body teleoperation level of 97.5 %.

Open Roadmap

OpenHLM does not claim to be a finished solution; instead it provides an open experimental roadmap addressing:

How to collect full‑body behavior data.

How to adapt VLA models to the high‑dimensional humanoid action space.

How to use low‑cost data streams for task expansion.

How to evaluate these capabilities with a unified benchmark.

Researchers aiming to build general humanoid operation systems can use this recipe as a clear starting point.

Paper: OpenHLM: An Empirical Recipe for Whole‑Body Humanoid Loco‑Manipulation (https://arxiv.org/abs/2606.22174)

Project website: https://openhlm-project.github.io/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

humanoid roboticsVision-Language-Actionloco-manipulationlow-cost dataOpenHLMwhole-body control
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.