Camera-Space Hand Mesh Recovery via Semantic Aggregation and Adaptive 2D-1D Registration
Camera-space hand mesh recovery (CMR) leverages semantic aggregation of 2D cues and adaptive 2D‑1D registration to predict absolute 3D hand pose and shape directly in camera coordinates, improving accuracy on benchmarks such as FreiHAND, RHD, and Human3.6M.
Introduction
Today we introduce one of the works selected by Kuaishou Y‑tech for CVPR 2021, Camera‑Space Hand Mesh Recovery via Semantic Aggregation and Adaptive 2D‑1D Registration . The main contribution is the use of semantic aggregation and multi‑dimensional registration to achieve hand‑mesh reconstruction in camera space.
1. Background
For virtual interaction tasks we study 3D hand reconstruction, called hand‑mesh recovery. A mesh encodes both pose and shape. Human hands provide strong priors, making the 2D‑to‑3D problem valuable: the core issue is learning the relationship between image features, geometric shape, and human kinematics. Because multi‑view images or 3D sensors are rarely available, 2D‑to‑3D has strong practical value.
2. Motivation
Predicting absolute 3D scale from a monocular image is ambiguous. In hand reconstruction there are two notions of absolute 3D scale: (1) the absolute coordinate of a point in camera space, and (2) the length of hand bones. Both are fundamentally unsolvable without additional assumptions, so most methods fix a root (the wrist for hands) and predict geometry in a root‑relative space. For many downstream tasks (e.g., hand‑object interaction) absolute camera‑space information is required, and because bone length varies little, recovering the root position in camera space (root recovery) becomes the primary goal.
Typical 2D‑to‑3D pipelines process the whole image with convolutions, generating many irrelevant 2D features. Feature selection often relies on 2D joint landmarks or silhouettes, yet the exact impact of these 2D priors on 3D mesh and root recovery has not been clearly analyzed. We therefore investigate the influence of 2D priors and design better ones.
3. Method
Figure 1: Overall CMR framework
CMR consists of three stages:
Prior stage (2D cue extraction): An hourglass‑style encoder‑decoder predicts 2D hand pose (21 keypoints) and silhouette.
Mesh generation stage (3D mesh recovery): The 2D cues are fused with shallow image features, down‑sampled again, and two decoders produce refined 2D attributes and a 3D mesh in root‑relative space.
Global registration stage (global mesh registration): The spatial relationship between 2D cues and the 3D mesh is used to optimize the absolute camera‑space root coordinate.
3.1 Semantic Aggregation
Silhouettes can be treated as 2D shape heatmaps, and 2D pose is usually regressed via heatmaps for each joint. Each joint heatmap encodes location and semantics, but standard heatmap regression loses the relational information between joints. We propose a semantic aggregation method that combines multiple 2D cues to better support 3D mesh reconstruction. The design includes:
Heatmaps for each 2D pose joint.
Heatmap for the silhouette.
Concatenation of the above heatmaps.
A compact representation that keeps location while discarding semantics.
Fusion of the compact representation with the aggregated heatmaps.
Figure 2: Sub‑pose representation of high‑level semantics
We emphasize the sub‑pose concept, which we find emerges spontaneously in the encoder even when only the 2D pose heatmap is used. Sub‑poses are expressed in three ways: part‑based, level‑based, and tip‑based. Tip‑based aggregation is motivated by (1) empirical observations that tip‑based sub‑poses appear most often, and (2) the importance of fingertip positions under inverse‑kinematics constraints.
3.2 3D Spiral Decoder
We adopt the SpiralConv graph convolution for mesh decoding and introduce three improvements:
Inception Spiral Module (ISM): An Inception‑style layer built on SpiralConv enlarges the receptive field for mesh data.
Multi‑scale fusion / supervision: A pyramid of mesh features is constructed; predictions at different scales are supervised and fused by addition.
Autoregressive connection: Features at each scale are concatenated with the prediction at that scale.
Figure 3: Spiral decoder with ISM, multi‑scale, and autoregressive design
3.3 Adaptive 2D‑1D Registration
Mesh reconstruction is performed in root‑relative space; however, monocular images alone cannot predict absolute camera‑space coordinates. Classical PnP exploits the relationship between 2D observations and 3D geometry to infer camera pose. We use the spatial relationship between 2D pose, silhouette, and the 3D mesh to optimize the absolute root coordinate.
Given camera parameters, a joint‑vertex matrix, and the mesh vertices in root‑relative space, we project the mesh to obtain 2D keypoints that correspond one‑to‑one with the network‑predicted 2D pose. The 2D optimization objective is a least‑squares reprojection error.
Silhouette loss requires differentiable rendering and dense computation, which is unsuitable for real‑time mobile optimization. Instead, we propose a 1D projection and registration scheme: we sample 12 uniformly spaced 1D axes, project the 2D mesh onto each axis, and extract the endpoints as 1D representations. The 1D objective is also a least‑squares error, solved via quadratic programming.
Both registration problems are solved as quadratic programs, yielding the optimal root translation t and scale s . An adaptive weighting scheme blends the 2D and 1D solutions based on their residual distances, giving more trust to the 2D term when the error is small and to the 1D term when the 2D pose is unreliable.
4. Experiments
4.1 Baseline
We re‑implemented the YouTubeHand baseline and replaced its 3D decoder with our spiral decoder.
Table 1: Comparison between our 3D decoder and the baseline
4.2 Semantic Aggregation
Table 2: Effect of different 2D priors
Results show that both 2D pose and silhouette provide useful priors, but pose contributes more to performance. Semantic aggregation of pose heatmaps (sub‑pose) yields richer relational features than treating each joint independently.
Figure 6: Feature expression based on 2D priors, showing overall hand shape and keypoint localization.
4.3 Adaptive 2D‑1D Registration
Table 3: Impact of 2D‑1D registration
We compare three variants (CMR‑P, CMR‑PG, CMR‑SG) that differ in the quality of predicted 2D pose and silhouette. The best 2D pose (CMR‑PG) benefits more from 2D registration, while the best silhouette (CMR‑SG) benefits more from 1D registration. Combining both yields the highest accuracy.
4.4 Comparison with Related Work
Table 4: Comparison on the FreiHAND dataset
Figure 7: 3D PCK comparison
Table 5: Comparison on Human3.6M
CMR achieves competitive or superior performance on FreiHAND, RHD, and Human3.6M datasets.
4.5 Visualization
Figure 8: Predicted results including silhouette, 2D pose, registered mesh, multi‑view mesh, and camera‑space mesh with pose.
5. Future Directions
Current pipelines (MANO‑based, voxel‑based, vertex‑based) are not lightweight; we are exploring mobile‑friendly designs.
Introducing biomechanical and physical constraints into the 2D space is a promising scientific problem.
Our method relies heavily on 3D data, which is costly to acquire and annotate; weak‑supervision may become a key research direction.
Dynamics are a distinctive property of the human body; studying dynamics can advance 3D human‑centric tasks.
References
Xingyu Chen, Yufeng Liu, Chongyang Ma, Jianlong Chang, Huayan Wang, Tian Chen, Xiaoyan Guo, Pengfei Wan, and Wen Zheng. "Camera‑Space Hand Mesh Recovery via Semantic Aggregation and Adaptive 2D‑1D Registration." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
"Graph Convolution: From Spectral Filtering to Spatial Spiral." https://mp.weixin.qq.com/s/bYi8kSUQ7jHeSJ5fts9pJQ
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.