Real-World Humanoid RL: LIFT’s Pretraining and On-Device Fine‑Tuning Paradigm

The paper presents LIFT, a framework that combines large‑scale off‑policy SAC pretraining with physics‑informed world‑model fine‑tuning to enable safe, sample‑efficient continual reinforcement learning on real humanoid robots, demonstrating zero‑sample deployment and rapid adaptation across diverse speed distributions.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Real-World Humanoid RL: LIFT’s Pretraining and On-Device Fine‑Tuning Paradigm

Humanoid robots can now dance, run, and even perform backflips, but deploying them in the real world raises the critical question of whether they can continue to learn safely and efficiently after deployment. Traditional Sim2Real pipelines rely on massive domain randomization and on‑policy PPO training, which freeze the policy after deployment and require costly, unsafe real‑world exploration.

Key Challenges

Deterministic data collection and limited data diversity make off‑policy and model‑based RL unstable or slow.

World‑model errors accumulate in high‑dimensional contact dynamics, degrading generated data quality.

Jointly training world models and policies in thousands of parallel simulations is computationally prohibitive.

LIFT Framework

LIFT (Large‑Scale Pretraining and Efficient Fine‑Tuning) addresses these challenges with three insights:

SAC outperforms PPO when data volume and diversity are limited. Its off‑policy nature yields higher sample efficiency, and its stochastic policy promotes richer exploration within the world model.

Large‑scale SAC pretraining yields a policy that can be deployed zero‑sample on real hardware. Hyper‑parameter search with Optuna reduced convergence time on the Booster T1 walking task from 7 hours to under 30 minutes, and scaling up update‑to‑data (UTD), batch size, and replay‑buffer size further lowered sample requirements.

Physics‑informed world models improve prediction and fine‑tuning. An ensemble network is combined with the robot’s dynamics (Eq. 2) to predict contact forces and uncertainties (Eq. 3), enabling accurate acceleration computation and state integration.

During real‑world fine‑tuning, LIFT collects a short deterministic data segment, updates the physics‑informed world model, then uses the SAC stochastic policy inside the model to generate synthetic trajectories for actor‑critic updates. The updated policy is returned to the robot for the next iteration, keeping risky exploration confined to the model.

Experimental Results

Experiments on Booster T1 and Unitree G1 compare LIFT against PPO and baseline SAC:

Pretraining convergence: In MuJoCo Playground, LIFT’s pretraining reward matches or exceeds PPO and FastTD3 within the same wall‑time, enabling zero‑sample deployment.

Sample efficiency in Brax fine‑tuning: Across in‑distribution, long‑tail, and out‑of‑distribution speed targets, LIFT converges with ~4×10⁴ environment samples (~800 s of real‑world time) and accurately tracks target speeds.

Real‑world fine‑tuning: Starting from a failed pretraining policy, LIFT improves stability using only 80–590 seconds of real data, correcting unsafe behaviors.

Ablation studies show that removing world‑model pretraining slows convergence, while omitting pretraining altogether leads to local optima. Pure ensemble world models produce physically implausible predictions (e.g., abnormal body height), causing critic loss explosion, whereas the physics‑informed model provides stronger inductive bias and robustness.

Conclusion and Future Directions

The results suggest that moving high‑risk exploration into a controllable world model makes real‑world humanoid RL feasible. Scaling this approach requires solving three bottlenecks: reliable state estimation without external motion capture, automated safety and reset mechanisms, and asynchronous data‑collection‑training pipelines to maintain throughput.

MLNLP community logo
MLNLP community logo
LIFT framework diagram
LIFT framework diagram
Zero‑sample deployment on real robot
Zero‑sample deployment on real robot
Training curves in Brax
Training curves in Brax
Real‑world fine‑tuning process
Real‑world fine‑tuning process
Pretraining ablation
Pretraining ablation
Physics‑informed model ablation
Physics‑informed model ablation
Pretraining effect on Booster T1
Pretraining effect on Booster T1
Post‑fine‑tuning performance
Post‑fine‑tuning performance
Full‑body tracking pretraining
Full‑body tracking pretraining
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

continuous learninghumanoid robotsim2realLIFTphysics‑informed world modelSAC
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.