DexJoCo: First High‑Difficulty Benchmark with 11 Dexterous Manipulation Tasks Covering Four Core Abilities
DexJoCo, a new MuJoCo‑based benchmark from the Chinese Academy of Sciences, introduces 11 complex dexterous‑hand tasks spanning tool use, bimanual collaboration, long‑horizon execution, and reasoning, and reveals that even state‑of‑the‑art robot learning models still struggle with reliable fine‑grained manipulation.
Recent advances in robot foundation models and dexterous‑hand hardware have shifted robotic manipulation from simple grasping toward complex functional interactions, raising the question of how to systematically evaluate true dexterous capabilities. Existing benchmarks focus on arm‑gripper pick‑and‑place tasks and cannot assess tool use, bimanual coordination, long‑range execution, or fine‑grained interaction.
To address this gap, the Institute of Automation of the Chinese Academy of Sciences introduced DexJoCo, a MuJoCo‑based benchmark and toolkit for task‑oriented dexterous manipulation. DexJoCo defines 11 functional tasks that cover four core ability dimensions:
Tool use – e.g., watering a plant, hammering a nail, storing glasses, operating a mouse.
Bimanual collaboration – e.g., assembling with two hands, unlocking a tablet, taking a photo.
Long‑horizon execution – e.g., opening a microwave, placing food, closing the door and starting it.
Reasoning – e.g., solving a Tower of Hanoi step or entering a password based on a language instruction.
Unlike traditional pick‑and‑place benchmarks, DexJoCo emphasizes functional interaction, finger‑level control, task‑sequence understanding, and two‑hand coordination, enabling researchers to probe the limits of dexterous hands in realistic scenarios.
The benchmark provides a complete workflow: task construction → human tele‑operation → trajectory collection → data format conversion → model training → policy evaluation. Human demonstrations are captured with Rokoko Smartgloves for finger motion, HTC Vive Tracker and Base Station for wrist tracking, and a remapping module that transfers human hand motions to an Allegro Hand. The hardware setup costs roughly $2,300, lowering the barrier for collecting high‑quality dexterous data.
Collected data (≈1.1 K human tele‑operation trajectories) can be exported to common formats such as LeRobot and Diffusion‑Policy Zarr, allowing direct training and evaluation of models like ACT, Diffusion Policy, π₀.5, and GR00T‑N1.5.
Evaluation on DexJoCo shows that even the most advanced robot learning strategies still face significant challenges. Experiments reveal a drop in success rates when visual conditions (camera angle, lighting, table texture) change, and frequent failures in bimanual, insertion, and button‑pressing tasks. Models often succeed at initial grasping but become unstable during fine interaction steps such as precise button pressing, accurate hole insertion, or sustained tool handling, and they may lose the task sequence in long‑horizon scenarios.
These results indicate a substantial gap between current robot policies and stable, reliable human‑level dexterous manipulation, highlighting the need for better unified modeling of vision, language, touch, and high‑dimensional hand actions.
DexJoCo’s broader goal is not to produce a leaderboard but to offer a standardized, reproducible, and extensible platform that helps the community answer key questions: where do dexterous hands truly outperform simple grippers, can current VLA models handle high‑dimensional hand spaces, what data‑capture methods best support dexterous tasks, and how should task design drive progress toward human‑level robot operation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
