How a Chinese Company Swept the Embodied Intelligence Olympics with Faster, Precise, Low‑Data Robotics
A Chinese robotics firm leveraged a self‑developed VLA model to win all three core tasks at Benjie’s Embodied Intelligence Olympics—peeling oranges, unlocking doors, and flipping socks—outperforming the industry leader Physical Intelligence by up to 35% faster speed, using 30% fewer samples and achieving higher precision in real‑world, fully autonomous scenarios.
For years the robot industry has showcased polished demo videos that hide the difficulty of operating in messy, real‑world environments; Benjie’s Humanoid Olympic Games were created to expose those demos by demanding fully autonomous, zero‑intervention performance on a set of 15 practical challenges.
The competition divides tasks into gold, silver and bronze difficulty. The three core tasks highlighted are a gold‑level unlocking challenge, a gold‑level orange‑peeling challenge, and a silver‑level sock‑flipping challenge. All tasks require millimetre‑level accuracy, and any 1–3 mm error leads to failure.
Star Motion Era (星动纪元) entered the arena with its self‑developed VLA (Vision‑Language‑Action) model and secured first place in all three tasks, beating the recognized industry leader Physical Intelligence (PI), which fielded its closed‑source π*0.6 model. The results were:
Orange‑peeling (gold): completed in 1 min 47 s, 35 % faster than PI’s 2 min 46 s, and achieved a tool‑free, pure‑hand operation.
Unlocking (gold): completed in 49 s, 25 % faster than PI’s 66 s, despite high‑glare visual interference.
Sock‑flipping (silver): completed in 1 min 04 s using only 120 training samples (32 % fewer than PI’s 176), a 30 % speed improvement.
The VLA model’s advantage stems from three technical innovations:
High sample efficiency : By leveraging large‑scale pre‑training, the model transfers generic visual‑motor knowledge to new tasks, enabling the sock‑flipping task to succeed with far fewer examples.
Adaptive visual attention : The model dynamically focuses on critical visual cues (e.g., keyholes under reflective glare), maintaining stable perception in high‑interference environments.
Asynchronous high‑frequency inference with short‑term planning : Instead of generating a single long trajectory, the system predicts subsequent trajectory chunks before the current one finishes, allowing rapid correction of deviations caused by dynamic object changes.
Beyond the competition, Star Motion Era collaborated with Stanford’s Chelsea Finn team to release the Ctrl‑World controllable world model, which topped the World Arena leaderboard over Google and Nvidia across consistency, trajectory accuracy, depth accuracy, and policy evaluation. Their ERA‑42 humanoid robot, equipped with the VLA model and the open‑source VPP (Video Prediction Policy) framework, has been deployed in logistics, manufacturing, and service scenarios, achieving 70‑80 % efficiency gains.
Overall, the three‑task sweep at Benjie’s Olympics demonstrates that Star Motion Era’s VLA architecture delivers superior data efficiency, robust perception, and fast reactive control, establishing a new benchmark for embodied intelligence in real‑world robotics.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
