HRNet Source Code Walkthrough: Keypoint Dataset Construction, Online Data Augmentation, and Training Pipeline
This article provides a detailed, English-language walkthrough of the HRNet source code, covering how the COCO keypoint dataset is built, the online data‑augmentation techniques applied during training, and the end‑to‑end training and inference procedures for human pose estimation.
Introduction
The author revisits the HRNet model for human pose estimation, aiming to deepen understanding beyond the high‑level principles previously discussed. The article is a source‑code‑centric tutorial that explains dataset preparation, data augmentation, network construction, training, and inference.
Keypoint Dataset Construction
The COCO human‑keypoint dataset is used. The core class CocoKeypoint loads image paths, annotation files, and builds several dictionaries ( self.dataset , self.anns , self.cats , self.imgs , self.imgToAnns , self.catToImgs ) to map images, annotations, and categories.
data_root = args.data_path # e.g., 'D://Dataset//coco2017'
data_transform = {
"train": transforms.Compose([
transforms.HalfBody(0.3, person_kps_info["upper_body_ids"], person_kps_info["lower_body_ids"]),
transforms.AffineTransform(scale=(0.65, 1.35), rotation=(-45, 45), fixed_size=fixed_size),
transforms.RandomHorizontalFlip(0.5, person_kps_info["flip_pairs"]),
transforms.KeypointToHeatMap(heatmap_hw=heatmap_hw, gaussian_sigma=2, keypoints_weights=kps_weights),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
]),
"val": transforms.Compose([
transforms.AffineTransform(scale=(1.25, 1.25), fixed_size=fixed_size),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
}During initialization, paths such as self.anno_path = os.path.join(root, "annotations", anno_file) are constructed, and the COCO API ( self.coco = COCO(self.anno_path) ) parses the JSON annotation file.
The parsing process fills self.valid_person_list with dictionaries that store bounding boxes, image metadata, and keypoint arrays for every person instance. After iterating over all images, the list contains 149 813 valid samples.
Online Data Augmentation
Four augmentation modules are applied to the training set (the validation set uses a subset of these):
transforms.HalfBody
transforms.AffineTransform
transforms.RandomHorizontalFlip
transforms.KeypointToHeatMap
Note that transforms.ToTensor and transforms.Normalize are preprocessing steps, not augmentations.
HalfBody
With probability p (default 0.3) the transform selects either the upper or lower body keypoints, computes a tight bounding box around the selected points, expands it by a factor of 1.5, and replaces the original target["box"] . This simulates partial‑occlusion scenarios.
def __call__(self, image, target):
if random.random() < self.p:
kps = target["keypoints"]
vis = target["visible"]
upper_kps, lower_kps = [], []
for i, v in enumerate(vis):
if v > 0.5:
(upper_kps if i in self.upper_body_ids else lower_kps).append(kps[i])
selected = upper_kps if random.random() < 0.5 else lower_kps
if len(selected) > 2:
selected = np.array(selected, dtype=np.float32)
xmin, ymin = np.min(selected, axis=0)
xmax, ymax = np.max(selected, axis=0)
w, h = xmax - xmin, ymax - ymin
if w > 1 and h > 1:
xmin, ymin, w, h = scale_box(xmin, ymin, w, h, (1.5, 1.5))
target["box"] = [xmin, ymin, w, h]
return image, targetAffineTransform
The transform first adjusts the original bounding box to the fixed input size (256×192) while preserving aspect ratio, then optionally scales (random factor) and rotates (random angle between –45° and 45°). An affine matrix is computed from three reference points (center, top‑middle, right‑middle) and applied to both the image and the keypoint coordinates.
src_center = np.array([(src_xmin + src_xmax) / 2, (src_ymin + src_ymax) / 2])
src_p2 = src_center + np.array([0, -src_h / 2]) # top middle
src_p3 = src_center + np.array([src_w / 2, 0]) # right middle
dst_center = np.array([(fixed_w - 1) / 2, (fixed_h - 1) / 2])
trans = cv2.getAffineTransform(src, dst)
resize_img = cv2.warpAffine(img, trans, (fixed_w, fixed_h), flags=cv2.INTER_LINEAR)RandomHorizontalFlip
When triggered (probability 0.5), the image is flipped horizontally ( np.flip(image, axis=[1]) ). Keypoint x‑coordinates are mirrored ( keypoints[:,0] = width - keypoints[:,0] - 1 ) and symmetric joint pairs defined in flip_pairs are swapped to keep left/right semantics correct.
# Flip image
image = np.ascontiguousarray(np.flip(image, axis=[1]))
# Flip keypoints
keypoints[:, 0] = width - keypoints[:, 0] - 1
for a, b in self.matched_parts:
keypoints[[a, b]] = keypoints[[b, a]].copy()KeypointToHeatMap
Each keypoint is rasterized into a Gaussian heatmap (default sigma = 2, kernel size ≈ 13×13). The heatmap resolution is 1/4 of the input image, so coordinates are divided by 4 and rounded. Only visible keypoints (visibility > 0.5) contribute; the Gaussian is clipped to the heatmap borders.
heatmap = np.zeros((num_kps, H, W), dtype=np.float32)
heatmap_kps = (kps / 4 + 0.5).astype(np.int)
for kp_id in range(num_kps):
if visible[kp_id] < 0.5:
continue
x, y = heatmap_kps[kp_id]
ul = [x - r, y - r]
br = [x + r, y + r]
# compute overlap with heatmap and copy Gaussian kernel
heatmap[kp_id][img_y0:img_y1+1, img_x0:img_x1+1] = kernel[g_y0:g_y1+1, g_x0:g_x1+1]After augmentation, the transformed image and its corresponding heatmap are fed to the HRNet backbone.
Network Construction
The HRNet architecture itself is not re‑implemented here; the author notes that building the network is straightforward once the code is examined, likening it to assembling building blocks.
Training and Inference
During training, a warm‑up learning‑rate scheduler is applied on the first epoch (if warmup=True ), followed by the usual optimizer steps. The main training loop iterates over data_loader using a generator ( log_every ) to keep memory usage low.
lr_scheduler = None
if epoch == 0 and warmup:
warmup_factor = 1.0 / 1000
warmup_iters = min(1000, len(data_loader) - 1)
lr_scheduler = utils.warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor)
for i, [images, targets] in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
# forward, loss, backward, optimizer step …For inference, the model outputs a heatmap tensor of shape (1, 17, 64, 48). The function get_max_preds extracts the arg‑max location for each joint, rescales it back to the original image using the inverse affine matrix, and applies a sub‑pixel refinement based on neighboring heatmap values.
def get_max_preds(batch_heatmaps):
batch_size, num_joints, h, w = batch_heatmaps.shape
heatmaps_reshaped = batch_heatmaps.view(batch_size, num_joints, -1)
maxvals, idx = torch.max(heatmaps_reshaped, dim=2)
preds = torch.zeros(batch_size, num_joints, 2).to(batch_heatmaps)
preds[..., 0] = idx % w
preds[..., 1] = torch.floor(idx / w)
# mask low‑confidence points
pred_mask = (maxvals > 0).float().unsqueeze(-1).repeat(1,1,2)
preds *= pred_mask
return preds, maxvalsAfter mapping the coordinates back to the original image space, a tiny offset ( ±0.25 ) is added based on the heatmap gradient to improve precision.
Conclusion
The article concludes that the HRNet source‑code analysis, especially the data‑augmentation pipeline and the training/inference utilities, provides a solid foundation for anyone wishing to experiment with human‑pose estimation or adapt the pipeline to other keypoint‑based tasks.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.