IPNet: Image‑Point Cloud Network for Accurate and Robust 3D Hand Pose Estimation
IPNet introduces a hybrid Image‑Point Cloud architecture that first extracts 2D visual features with a CNN, projects them into 3D point‑cloud space, and iteratively refines hand pose using a sparse‑anchor “aggregate‑interact‑propagate” scheme, achieving state‑of‑the‑art results on challenging hand‑object datasets.
Introduction
Estimating 3D hand pose from depth data is essential for human‑computer interaction, virtual reality, and augmented reality, but it suffers from occlusion and finger self‑similarity. Prior works treat depth either as a 2D image or as an independent 3D point cloud, ignoring complementary geometric information or incurring heavy computation.
IPNet Architecture
The proposed Image‑Point cloud Network (IPNet) combines both representations. First, depth is rendered as a 2D image and processed by a fully convolutional 2D‑CNN to learn visual features and produce an initial hand pose estimate.
Next, a 2D‑3D projection module maps the 2D visual features, the re‑parameterized hand pose, and the original point coordinates into an initial point‑cloud feature set. For each point p, the K nearest image features are selected and interpolated based on their 3D Euclidean distances.
The 3D points are then re‑parameterized into a 3D heatmap and a unit direction vector using the initial pose.
Sparse‑Anchor Iterative Refinement
IPNet introduces a sparse‑anchor “aggregate‑interact‑propagate” paradigm. Estimated hand joints serve as sparse anchors; local neighborhoods around each anchor are built and their features aggregated, dramatically reducing irregular memory access.
SemGCN is employed to propagate information between anchors, modeling long‑range relationships. For every point, the K nearest anchors are queried, their interpolated features are combined with the point’s own features, and the point‑cloud representation is updated. The refined point cloud is finally regressed to a more accurate 3D hand pose.
Experiments
IPNet was evaluated on three challenging datasets, including a hand‑object interaction benchmark. It achieved state‑of‑the‑art performance across all datasets, with a particularly large margin over previous methods in the hand‑object interaction scenario.
Conclusion
The paper demonstrates that fusing depth‑image and point‑cloud representations via IPNet yields efficient and robust 3D hand pose estimation. The anchor‑based iterative correction leverages 3D geometric structure, reduces computational overhead, and improves robustness to occlusion and depth holes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Network Intelligence Research Center (NIRC)
NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
