How Regressive Domain Adaptation Boosts Unsupervised Keypoint Detection
This article reviews the CVPR2021 paper on Regressive Domain Adaptation (RegDA) for unsupervised keypoint detection, explaining its motivation, novel adversarial regression framework, sparse output-space modeling, min‑min training strategy, extensive experiments, and the resulting performance gains across multiple datasets.
Introduction
Deep networks rely on large labeled datasets, but annotating keypoints is costly. Domain adaptation aims to transfer models trained on labeled source domains to unlabeled target domains, reducing annotation effort. While virtual data provides cheap keypoint labels, most existing domain‑adaptation methods fail on regression tasks because regression outputs lack clear decision boundaries.
Method
Inspired by recent transfer‑learning theory, RegDA introduces an adversarial regressor f' that maximizes prediction discrepancy on the target domain while encouraging the feature extractor ψ to minimize it, thus learning domain‑invariant features. Instead of the mean‑squared error used for heat‑map regression, KL divergence is employed to compare heat‑maps, avoiding gradient explosion. An error‑probability distribution over the output space exploits its sparsity to guide the adversarial regressor, and the traditional minimax game is reformulated as a min‑min problem with two opposite objectives.
Sparse Spatial Probability Density
The output space of keypoint detectors is high‑dimensional and sparse: most locations have near‑zero probability, while a few have high probability. RegDA aggregates heat‑maps of incorrect keypoints to form an error distribution, which then steers f' toward less‑confident regions, effectively reducing the output space.
Min‑max Game or Min‑min Game?
Standard adversarial training maximizes KL divergence between the main regressor f and the adversarial regressor f' , but this mainly widens variance without shifting the mean. RegDA instead minimizes two opposite losses: (1) the KL divergence between f' and the ground‑false predictions of f on the target (maximizing discrepancy), and (2) the KL divergence between f' and the ground‑truth predictions of f (minimizing discrepancy). This min‑min formulation simplifies optimization in high‑dimensional spaces.
Experiments
RegDA was evaluated on several transfer tasks: dSprites (shape dataset), virtual hand to real hand (RHD→H3D), virtual human to real human (Surreal→Human3.6M), and virtual indoor to real outdoor human (Surreal→LSP). On hand keypoint detection, RegDA improved average accuracy by 10.7% over source‑only training. On human pose datasets, it raised PCK scores by 8.3% (Human3.6M) and 10.7% (LSP). Visual results show better alignment with true hand and body structures compared to source‑only models.
Conclusion
RegDA uncovers the sparsity of regression output spaces and bridges the gap between regression and classification tasks. By converting the adversarial game to a pair of minimization objectives, it alleviates optimization difficulties in high‑dimensional spaces and consistently delivers 8%–11% PCK improvements across diverse keypoint detection benchmarks.
References
Yuchen Zhang, Tianle Liu, Mingsheng Long, and Michael I. Jordan. Bridging theory and algorithm for domain adaptation. ICML 2019.
Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. Towards 3D human pose estimation in the wild: a weakly‑supervised approach. ICCV 2017.
Gyeongsik Moon, Kyoung Mu Lee. I2L‑MeshNet: Image‑to‑Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image. ECCV 2020.
Fu Xiong, Boshen Zhang, Yang Xiao, Zhiguo Cao, Taidong Yu, Joey Tianyi Zhou, Junsong Yuan. A2J: Anchor‑to‑Joint Regression Network for 3D Articulated Pose Estimation from a Single Depth Image. ICCV 2019.
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain‑adversarial training of neural networks. JMLR 2016.
Junho Kim, Minjae Kim, Hyeonwoo Kang, Kwanghee Lee. U‑GAT‑IT: Unsupervised Generative Attentional Networks with Adaptive Layer‑Instance Normalization for Image‑to‑Image Translation. ICLR 2020.
Bin Xiao, Haiping Wu, Yichen Wei. Simple Baselines for Human Pose Estimation and Tracking. ECCV 2018.
Kuaishou Large Model
Official Kuaishou Account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.