Artificial Intelligence 11 min read

Control Humanoid Robot Motion with a Sentence or Music via OMG Framework

OMG introduces a hierarchical “generation brain + tracking cerebellum” framework that leverages a large multimodal dataset and diffusion‑based OMG‑DiT network to let humanoid robots synthesize full‑body motions from a single sentence, music clip, or pose, achieving state‑of‑the‑art performance across text, audio, and motion benchmarks.

Machine Heart

Jun 29, 2026

Control Humanoid Robot Motion with a Sentence or Music via OMG Framework

Industry Pain Point: Lack of Autonomous Interaction in Humanoid Robots

Current humanoid robot motion control relies on pre‑recorded reference trajectories, preventing autonomous generation of new actions and limiting flexible human‑robot interaction.

OMG‑Data: A Thousand‑Hour Multimodal Motion Dataset

The team assembled a standardized data cleaning pipeline that merges public sources such as AMASS, LAFAN, dance, and speech‑gesture pairs, removes corrupted frames and abnormal joint angles, and uses a universal motion retargeting technique (GMR) to map heterogeneous data (SMPL models, video‑reconstructed bodies, FBX animations) into the Yushu G1 robot’s action space.

Unlabeled segments are rendered in MuJoCo from multiple viewpoints, annotated with fine‑grained temporal semantics via a vision‑language model, and segmented using text boundaries, musical phrases, and sliding windows to suit short‑term prediction training.

Physical feasibility is ensured by simulating each candidate motion, checking height, tilt, fall frames, and joint limits; invalid samples are discarded. The final OMG‑Data set totals 1,174.66 hours, including 1,166.6 h of text‑annotated motions, 958.77 h of human reference motions, and 191.6 h of audio‑paired motions, all ready for direct robot training.

OMG‑DiT: Extensible Lightweight DiT Motion Generation Backbone

OMG‑DiT implements a “shared backbone + lightweight modality adapters” design, decoupling generic humanoid motion priors from multimodal condition inputs, allowing new control modalities to be added via small adapters without retraining the backbone.

The hierarchical generation‑tracking architecture uses OMG‑DiT to predict the next 60 frames of the Yushu G1 full‑body trajectory from historical states, text, audio, or pose inputs, while the HoloMotion tracker converts the trajectory into joint commands for balance, disturbance rejection, and tracking.

Training occurs directly in the robot’s 125‑dimensional action space, using a DiT‑based denoising backbone with RoPE positional encoding and temporal self‑attention. Random modality dropout during training and classifier‑free guidance during inference enable seamless switching between single‑modality and multimodal commands.

For three core control modalities, distinct feature‑injection schemes are used: frozen T5‑Base encodes text semantics injected via cross‑attention; audio and pose signals are aligned frame‑wise, projected through MLPs, and modulated per frame via FiLM to achieve precise rhythm matching and pose replication.

The framework’s modality‑extensibility is demonstrated with a Pico VR key‑point teleoperation task, where a zero‑initialization FiLM adapter integrates the new modality without altering pretrained weights, and a few‑shot fine‑tune adapts the model to the task.

Comprehensive Experiments Validate Generation Performance and Generality

Evaluations cover horizontal performance comparison, downstream few‑shot transfer, and foundational model characteristics. All generated trajectories are executed in simulation and verified by the tracker, measuring generation quality, tracking stability, and fall rate.

In multimodal generation benchmarks, OMG‑XL achieves the lowest text‑driven FID (6.03), R‑Precision@1 of 65.43 %, and a fall rate of 0.78 %, outperforming GENMO, HYMotion, Kimodo, and others.

For audio‑driven dance, the model attains an audio‑matching FID_k of 40.46 with zero falls, faithfully following diverse musical styles.

In human‑pose retargeting, OMG‑DiT records an MPJPE of 18.84, surpassing traditional GMR, NMR, and OmniRetarget pipelines while delivering stable, robot‑trackable trajectories.

Downstream fine‑tuning shows that adapting the pretrained model with only 1 % of AMASS‑CMU data matches training from scratch on the full dataset, and on the Pico key‑point task, the pretrained model outperforms random initialization, confirming strong cross‑scene and cross‑modality generalization.

Scaling experiments reveal that larger model sizes consistently improve motion generation metrics under fixed data and evaluation conditions, demonstrating that humanoid motion generation benefits from model scaling.

Zero‑shot multimodal composition experiments show the model can fuse unseen text‑audio combinations during inference, preserving semantic coherence and rhythmic alignment, and supporting real‑time modality switching for interactive applications.

Open‑Source Release

The authors provide the full OMG framework, the OMG‑Data dataset, and code repositories (https://github.com/Tsinghua-MARS-Lab/OMG) for reproducibility and further research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Diffusion Models AI generation humanoid robotics robot control multimodal motion OMG framework

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.