Embodied AI Breakthrough: Beijing Humanoid’s Pelican‑Unify 1.0 Tops WorldArena and Wins Dual Crown
The article details how Beijing Humanoid’s Pelican‑Unify 1.0 model achieved top scores on WorldArena—including a 66.03 overall rating and 98.12% 3D accuracy—by unifying perception, reasoning, imagination and action in a single latent space, marking a milestone for model‑based end‑to‑end embodied intelligence.
Figure AI’s humanoid robot streamed live for dozens of hours on a logistics conveyor, continuously recognizing, grasping and sorting packages using only its onboard visual system, demonstrating sustained embodied intelligence in a production setting.
Model‑Based End‑to‑End Reasoning
Robots now must perceive the environment, understand goals, plan motions and execute tasks in a constantly changing physical world, moving from preset single‑action programs to unified agents that complete tasks.
WorldArena Performance
Pelican‑Unify 1.0 achieved the highest composite EWM score of 66.03 and a 98.12 % 3D Accuracy on the WorldArena benchmark, leading both the core tracks and earning the first “dual‑crown” for embodied intelligence. The model also topped the WorldArena Data Engine track earlier, receiving Hugging Face’s recommendation and citations from Stanford and Physical Intelligence.
Huisi Kaifu Platform
In March 2025 the “Huisi Kaifu” platform was announced as a “one‑brain‑many‑abilities, one‑brain‑many‑machines” embodied AI system that combines a large‑model‑driven planning brain with a data‑driven skill‑execution cerebellum, enabling a single software stack to run on robotic arms, wheeled robots and humanoids. Pelican‑Unify 1.0, released in May 2026, serves as the unified foundational model for this platform.
Technical Report
Technical report: Pelican‑Unify 1.0: A Unified Embodied Intelligence Model (UEI) for Understanding, Reasoning, Imagination and Action – https://arxiv.org/pdf/2605.15153
Architecture
The model implements three kinds of unification:
Understanding, reasoning and generation share a dense latent variable z, allowing gradients from language, video and action losses to jointly shape the same representation.
Unified encoder based on Qwen3‑VL 4B ingests a multimodal context c_t that concatenates past observations o (image frames), action history a, and the current language instruction l, encoding them into a shared semantic space.
Unified Future Generator (UFG) uses a diffusion Transformer (DiT) initialized from Wan2.2‑5B, conditioned on z, to jointly denoise and generate future video tokens and low‑level action tokens in a single diffusion process.
The reasoning trace τ_t is generated autoregressively by the VLM, projected to z via a linear layer P_ϕ, and then fed to the diffusion model. Three losses act on z:
Language loss 𝓛_text aligns z with task semantics.
Video loss 𝓛_video forces z to predict physical dynamics.
Action loss 𝓛_action anchors z within a feasible control space.
Only representations that satisfy all three pressures survive training, ensuring consistency among understanding, reasoning, imagination and action.
Experimental Results
WorldArena: composite score 66.03 , 3D Accuracy 98.12 % , indicating near‑perfect spatial modeling.
Across eight VLM benchmarks the model achieved an average score of 64.7 , surpassing specialist models with gains of 28.2 on Where2Place and 20.6 on PhyX.
RoboTwin 50‑task dual‑arm benchmark: overall success rate 93.5 % ; 31 tasks ≥ 95 % success, 15 tasks reached 100 %.
Real‑world validation on a UR5e arm and the TianGong humanoid showed zero‑shot and compositional generalization outperforming modular baselines. In compositional tests the model, trained only on atomic tasks, executed unseen task sequences without explicit state‑machine logic.
Action‑conditioned video prediction experiments demonstrated fine‑grained alignment between commanded actions and generated video frames, confirming the model’s ability to imagine future outcomes conditioned on planned motions.
Deployment Highlights
September 2025: “TianGong 2.0” equipped with Huisi Kaifu performed material handling on an unmanned production line at a Cummins engine factory.
October 2025: Huisi Kaifu SDK released as open‑source, enabling academic and industrial partners to build on the unified model.
World AI Conference: the platform coordinated four heterogeneous robots for asynchronous multi‑task collaboration.
Future Outlook
The unified approach embodied by Pelican‑Unify 1.0 provides a concrete answer to the central question of embodied AI: can a single model reliably understand, imagine and act in previously unseen physical scenarios? The authors argue that the multiplicative synergy of shared representation will drive the next wave of physical‑world AI breakthroughs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
