Artificial Intelligence 17 min read

Embodied AI Breakthrough: Beijing Humanoid’s Pelican‑Unify 1.0 Tops WorldArena and Wins Dual Crown

The article details how Beijing Humanoid’s Pelican‑Unify 1.0 model achieved top scores on WorldArena—including a 66.03 overall rating and 98.12% 3D accuracy—by unifying perception, reasoning, imagination and action in a single latent space, marking a milestone for model‑based end‑to‑end embodied intelligence.

Machine Heart

May 16, 2026

Embodied AI Breakthrough: Beijing Humanoid’s Pelican‑Unify 1.0 Tops WorldArena and Wins Dual Crown

Figure AI’s humanoid robot streamed live for dozens of hours on a logistics conveyor, continuously recognizing, grasping and sorting packages using only its onboard visual system, demonstrating sustained embodied intelligence in a production setting.

Model‑Based End‑to‑End Reasoning

Robots now must perceive the environment, understand goals, plan motions and execute tasks in a constantly changing physical world, moving from preset single‑action programs to unified agents that complete tasks.

WorldArena Performance

Pelican‑Unify 1.0 achieved the highest composite EWM score of 66.03 and a 98.12 % 3D Accuracy on the WorldArena benchmark, leading both the core tracks and earning the first “dual‑crown” for embodied intelligence. The model also topped the WorldArena Data Engine track earlier, receiving Hugging Face’s recommendation and citations from Stanford and Physical Intelligence.

Huisi Kaifu Platform

In March 2025 the “Huisi Kaifu” platform was announced as a “one‑brain‑many‑abilities, one‑brain‑many‑machines” embodied AI system that combines a large‑model‑driven planning brain with a data‑driven skill‑execution cerebellum, enabling a single software stack to run on robotic arms, wheeled robots and humanoids. Pelican‑Unify 1.0, released in May 2026, serves as the unified foundational model for this platform.

Technical Report

Technical report: Pelican‑Unify 1.0: A Unified Embodied Intelligence Model (UEI) for Understanding, Reasoning, Imagination and Action – https://arxiv.org/pdf/2605.15153

Architecture

The model implements three kinds of unification:

Understanding, reasoning and generation share a dense latent variable z, allowing gradients from language, video and action losses to jointly shape the same representation.

Unified encoder based on Qwen3‑VL 4B ingests a multimodal context c_t that concatenates past observations o (image frames), action history a, and the current language instruction l, encoding them into a shared semantic space.

Unified Future Generator (UFG) uses a diffusion Transformer (DiT) initialized from Wan2.2‑5B, conditioned on z, to jointly denoise and generate future video tokens and low‑level action tokens in a single diffusion process.

The reasoning trace τ_t is generated autoregressively by the VLM, projected to z via a linear layer P_ϕ, and then fed to the diffusion model. Three losses act on z:

Language loss 𝓛_text aligns z with task semantics.

Video loss 𝓛_video forces z to predict physical dynamics.

Action loss 𝓛_action anchors z within a feasible control space.

Only representations that satisfy all three pressures survive training, ensuring consistency among understanding, reasoning, imagination and action.

Experimental Results

WorldArena: composite score 66.03 , 3D Accuracy 98.12 % , indicating near‑perfect spatial modeling.

Across eight VLM benchmarks the model achieved an average score of 64.7 , surpassing specialist models with gains of 28.2 on Where2Place and 20.6 on PhyX.

RoboTwin 50‑task dual‑arm benchmark: overall success rate 93.5 % ; 31 tasks ≥ 95 % success, 15 tasks reached 100 %.

Real‑world validation on a UR5e arm and the TianGong humanoid showed zero‑shot and compositional generalization outperforming modular baselines. In compositional tests the model, trained only on atomic tasks, executed unseen task sequences without explicit state‑machine logic.

Action‑conditioned video prediction experiments demonstrated fine‑grained alignment between commanded actions and generated video frames, confirming the model’s ability to imagine future outcomes conditioned on planned motions.

Deployment Highlights

September 2025: “TianGong 2.0” equipped with Huisi Kaifu performed material handling on an unmanned production line at a Cummins engine factory.

October 2025: Huisi Kaifu SDK released as open‑source, enabling academic and industrial partners to build on the unified model.

World AI Conference: the platform coordinated four heterogeneous robots for asynchronous multi‑task collaboration.

Future Outlook

The unified approach embodied by Pelican‑Unify 1.0 provides a concrete answer to the central question of embodied AI: can a single model reliably understand, imagine and act in previously unseen physical scenarios? The authors argue that the multiplicative synergy of shared representation will drive the next wave of physical‑world AI breakthroughs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

benchmark Embodied AI robotics Multimodal Learning Unified Model WorldArena Pelican-Unify

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Model‑Based End‑to‑End Reasoning

WorldArena Performance

Huisi Kaifu Platform

Technical Report

Architecture

Experimental Results

Deployment Highlights

Future Outlook

Machine Heart

How this landed with the community

Was this worth your time?

0 Comments

Huisi Kaifu Platform