Control Humanoid Robot Motion with a Sentence or Music via OMG Framework
OMG introduces a hierarchical “generation brain + tracking cerebellum” framework that leverages a large multimodal dataset and diffusion‑based OMG‑DiT network to let humanoid robots synthesize full‑body motions from a single sentence, music clip, or pose, achieving state‑of‑the‑art performance across text, audio, and motion benchmarks.
