Artificial Intelligence 15 min read

Winning the AI Challenger Human Pose Keypoint Contest with Multi‑Scale Fusion

The Firefly team secured first place in the AI Challenger human skeletal keypoint detection contest by employing a top‑down approach that combines Faster R‑CNN for person detection, a multi‑scale region‑feature fusion strategy, varied Gaussian and binary supervision, OKS‑NMS post‑processing, and extensive experiments demonstrating the impact of input size, supervision radius, and model stacking.

Baobao Algorithm Notes

Jan 21, 2018

Winning the AI Challenger Human Pose Keypoint Contest with Multi‑Scale Fusion

Competition and Evaluation

The AI Challenger human skeletal keypoint detection competition requires locating 14 predefined joints per person in natural images. Performance is measured by mean Average Precision (mAP) using Object Keypoint Similarity (OKS) as the similarity metric, analogous to IoU in object detection.

Dataset

Data are split into training (70%), validation (10%), test‑A (10%) and test‑B (10%). Each person is annotated with 14 keypoints (right/left shoulder, elbow, wrist, hip, knee, ankle, head top, neck). Keypoints have three visibility states: visible, invisible, or out‑of‑frame.

Methodology

A top‑down framework is employed:

Human detection : Faster R‑CNN (COCO‑state‑of‑the‑art backbone) generates bounding boxes. Boxes with confidence 0.4 are kept.

Pose estimation : A 2‑stack Hourglass network predicts heatmaps for each keypoint inside each detected box.

Multi‑scale region‑feature fusion :

Train separate models with different Gaussian supervision radii (large‑radius, medium‑radius, small‑radius). Each model outputs a heatmap.

During inference, multiply the heatmaps element‑wise ( fused = H_large * H_medium * H_small) to enforce consensus, effectively combining coarse part localization with fine joint refinement.

Enhanced supervision :

Large‑radius models use Gaussian‑shaped targets and mean‑square error loss.

Small‑radius models treat each pixel as a binary classification (keypoint vs background) and use binary cross‑entropy loss.

Post‑processing :

Apply standard NMS with IoU threshold 0.7 to remove duplicate detection boxes.

Apply OKS‑NMS with threshold 0.5 on the fused heatmaps to suppress overlapping person instances while preserving closely interacting people.

Training Details

Input resolution experiments: 256×256, 384×384, and 448×448. Larger inputs consistently improve AP (e.g., 384×384 > 256×256).

Batch size 96 on an 8‑GPU server (Titan XP). Batch size has limited impact on final AP.

Data augmentation follows the Hourglass implementation (random rotation, scaling, flip).

Training uses only the AI Challenger data; no extra background or unlabeled pedestrian handling is required.

Experiments and Results

2‑stack Hourglass with three‑radius fusion achieves the highest AP on both test‑A and test‑B, securing first place.

Increasing the number of stacks (4‑stack, 8‑stack) yields marginal AP gains while significantly increasing memory and computation.

On the COCO validation set, the same pipeline (2‑stack, 448×448 input, three‑radius fusion) attains state‑of‑the‑art performance when combined with the 2017 team OKS technique.

Practical Insights

Confidence threshold for Faster R‑CNN boxes (0.4) was chosen via coarse grid search and visual inspection.

OKS‑NMS after standard NMS effectively handles cases where people are close together (e.g., parent‑child) without suppressing true positives.

Training time for 256×256 input on an 8‑GPU machine is roughly 10–12 hours per full training run.

Q&A Highlights

4‑stack models improve AP only slightly; 8‑stack provides modest gains but at higher cost.

Batch size variations (e.g., 96 vs smaller) have negligible effect on final AP.

No experiments were conducted with mixed Gaussian/binary supervision across scales.

Hardware: 8 GPU server (Titan XP). Data augmentations include rotation, scaling, and horizontal flip.

Conclusion

The described top‑down pipeline—Faster R‑CNN detection, 2‑stack Hourglass pose estimation, multi‑scale Gaussian supervision, heatmap multiplication, and OKS‑aware NMS—delivers top‑rank performance on the AI Challenger human keypoint benchmark and transfers effectively to COCO.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

keypoint detection human pose estimation top-down approach AI Challenger multi-scale fusion OKS-NMS

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.