How JD‑Tech’s AnchorDP3 Dominated the CVPR 2025 Dual‑Arm Robotics Challenge
JD‑Tech leveraged large‑model innovations and a novel AnchorDP3 3D diffusion policy to win both stages of the CVPR 2025 dual‑arm manipulation competition, showcasing breakthroughs in synthetic data generation, multimodal perception, and precise trajectory control for embodied AI robots.
The rise of large models has revitalized the robotics field, and Stanford’s “stir‑fry robot” project sparked a global surge in embodied intelligence. In this context, embodied operation has become a key focus for both academia and industry. JD‑Tech’s JD‑TFS team won both the first and second stages of the CVPR 2025 robot dual‑arm manipulation challenge, outperforming Horizon, NIO, Tsinghua, Harbin Institute of Technology and other leading groups.
CVPR 2025 Challenge Overview
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025, held in the United States, centered on “generative AI era multi‑agent embodied systems.” The RoboTwin dual‑arm competition, one of three core events, used the RoboTwin simulation platform and the Cobot‑Magic physical platform, featuring a simulated track and an on‑site track to address the complexity of robot operations in both virtual and real environments.
The challenge emphasized dual‑arm tasks such as block stacking, phone placement, shoe placement, and mouse placement. Success was measured by the highest possible operation success rate in a randomized simulated environment.
Real‑World Task Example
In the second simulation stage, the “store shoes based on language instruction” task required the robot to locate a shoe on a cluttered desk with random height, texture, and lighting, then follow the natural‑language command “place the shoe in the box.” This exemplifies the coupling of multimodal perception and cross‑modal control, bringing the task closer to real‑world robot work scenarios.
Technical Breakthrough: AnchorDP3
Building on the 3D diffusion policy, JD‑Tech introduced the AnchorDP3 model, which uses task‑centric 3D visual representations to collect and train on diverse background and object data. The architecture employs a simplified PointNet for point‑cloud feature extraction, a lightweight BERT for language understanding, and a unified diffusion action expert to generate task‑specific action sequences. This modular design enables a small‑parameter, multi‑task, multi‑head model that can be trained end‑to‑end for embodied multimodal operations.
Data construction was optimized in three ways:
Action expert outputs were discretized into keypoints representing pre‑grasp poses, expanding the trajectory dataset from thousands to hundreds of thousands of samples and greatly improving coverage of random settings.
Failed trajectories were retained using a DAgger‑style random perturbation method, allowing the model to learn recovery strategies from failure to success.
Both joint coordinates and end‑effector coordinates were output, providing direct control of the arm while ensuring precise task‑level positioning, thereby boosting success rates on complex grasping tasks.
Supply‑Chain Integration and Future Outlook
In April 2024, JD‑Tech released China’s first dual‑arm mobile robot operation dataset, offering a benchmark for the industry. Leveraging its extensive supply‑chain infrastructure, JD integrates its large‑model‑driven dialogue agents (Joy Inside) into hardware, collaborating with leading robot brands for the 618 shopping festival and forming a “robotic super‑team.”
The CVPR challenge not only provided JD with a high‑quality platform for synthetic‑data training but also bridged the gap between embodied‑intelligence research and practical applications. JD plans to continue attracting talent and driving the robotics industry forward.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
