Beyond SONIC: Humanoid Robot Cerebellum Hits GPT‑Level Performance with 2 B Motion‑Capture Frames
Galaxy General unveils AstraBrain‑WBC 0.5, a transformer‑based humanoid robot control model that scales from 200 K to 2 billion motion‑capture frames, achieving up to 92.58% tracking success, 0.39 ms latency, and five‑fold speed over TWIST, thereby confirming a scaling law for robot motion control.
The research addresses the limitation of current humanoid robots, which rely on sample‑fit models that struggle with unseen motions and out‑of‑distribution tasks. Galaxy General proposes AstraBrain‑WBC 0.5, the first globally validated scaling law for motion control, built on a GPT‑style causal Transformer.
Data Scaling Experiments
Using the largest motion‑capture dataset ever assembled—20 billion frames collected from AMASS, LAFAN1, Motion‑X++, PHUMA, MotionMillion and over a thousand hours of self‑collected data—the team trained models of increasing size. On the AMASS test set, a 200 K‑frame MLP achieved 76.89% success rate (SR), TCN 81.48%, while AstraBrain‑WBC 0.5‑S (Transformer, same data) reached 83.26%.
When training data grew to 2 billion frames, AstraBrain‑WBC 0.5‑B improved SR to 90.43% and the largest model, AstraBrain‑WBC 0.5‑L, attained 92.58%.
Comparing MPJPE, the best TCN variant recorded 56.15 mm error, whereas AstraBrain‑WBC 0.5‑S achieved 43.25 mm—over 30% improvement. Transformer loss continued to decline without saturation.
Data‑Size Ablation
Fixing model size (Humanoid‑GPT‑B) and varying frame count showed a clear power‑law: each ten‑fold increase in data reduced error monotonically, with no inflection point, confirming the scaling law in the physical domain.
Real‑World Validation
AstraBrain‑WBC 0.5 was deployed on the Galbot G1 robot and evaluated on four unseen dance sequences. Compared against GMT, TWIST, and Any2Track under identical protocols, it consistently matched or outperformed all baselines in MPJPE, despite using raw internet video motions without fine‑tuning.
Optimized with TensorRT and C++ pipelines, inference latency dropped to 0.39 ms, maintaining a 50 Hz control loop—approximately five times faster than TWIST (2.79 ms). Larger models ran faster due to specialized kernels for causal attention and MLP fusion.
Architecture and Training Pipeline
The system replaces traditional MLP/TCN backbones with a causal Transformer that employs a temporal causal mask, ensuring only past frames influence predictions. This enables parallel processing of entire sequences during training and low‑latency autoregressive inference.
Training proceeds in three stages: (a) data curation and Harmonic Motion Embedding (HME) to balance diversity and coverage across ~300 motion clusters; (b) PPO‑based expert policies trained on each cluster (~384 experts); (c) DAgger distillation of all experts into a single Transformer policy. Ablations showed ~384 experts provide the best trade‑off between diversity and training cost.
The full training consumed ~15 000 GPU‑hours (75% expert training on RTX 4090, 25% distillation on H100). After distillation, only the unified model is needed for deployment.
Broader Implications and Future Work
The study demonstrates that MLP limitations stem from architectural scalability, not data scarcity; switching to Transformers unlocks continuous performance gains as data scales. With 804 M parameters, AstraBrain‑WBC 0.5 approaches the size of early large‑language models (GPT‑1).
Limitations include the lack of visual or semantic understanding. The authors plan to integrate vision‑language‑action (VLA) modalities to create a full embodied foundation model.
Related work includes the SONIC system (1 × 10⁸ frames, MLP‑based) and the Humanoid‑GPT paper (arXiv:2606.03985), accepted at CVPR 2026.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
