Artificial Intelligence 36 min read

HRNet Source Code Walkthrough: Keypoint Dataset Construction, Online Data Augmentation, and Training Pipeline

This article provides a detailed, English-language walkthrough of the HRNet source code, covering how the COCO keypoint dataset is built, the online data‑augmentation techniques applied during training, and the end‑to‑end training and inference procedures for human pose estimation.

Rare Earth Juejin Tech Community

Jun 16, 2024

HRNet Source Code Walkthrough: Keypoint Dataset Construction, Online Data Augmentation, and Training Pipeline

Introduction

The author revisits the HRNet model for human pose estimation, aiming to deepen understanding beyond the high‑level principles previously discussed. The article is a source‑code‑centric tutorial that explains dataset preparation, data augmentation, network construction, training, and inference.

Keypoint Dataset Construction

The COCO human‑keypoint dataset is used. The core class CocoKeypoint loads image paths, annotation files, and builds several dictionaries ( self.dataset, self.anns, self.cats, self.imgs, self.imgToAnns, self.catToImgs) to map images, annotations, and categories.

data_root = args.data_path   # e.g., 'D://Dataset//coco2017'
data_transform = {
    "train": transforms.Compose([
        transforms.HalfBody(0.3, person_kps_info["upper_body_ids"], person_kps_info["lower_body_ids"]),
        transforms.AffineTransform(scale=(0.65, 1.35), rotation=(-45, 45), fixed_size=fixed_size),
        transforms.RandomHorizontalFlip(0.5, person_kps_info["flip_pairs"]),
        transforms.KeypointToHeatMap(heatmap_hw=heatmap_hw, gaussian_sigma=2, keypoints_weights=kps_weights),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ]),
    "val": transforms.Compose([
        transforms.AffineTransform(scale=(1.25, 1.25), fixed_size=fixed_size),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
}

During initialization, paths such as

self.anno_path = os.path.join(root, "annotations", anno_file)

are constructed, and the COCO API ( self.coco = COCO(self.anno_path)) parses the JSON annotation file.

The parsing process fills self.valid_person_list with dictionaries that store bounding boxes, image metadata, and keypoint arrays for every person instance. After iterating over all images, the list contains 149 813 valid samples.

Online Data Augmentation

Four augmentation modules are applied to the training set (the validation set uses a subset of these):

transforms.HalfBody

transforms.AffineTransform

transforms.RandomHorizontalFlip

transforms.KeypointToHeatMap

Note that transforms.ToTensor and transforms.Normalize are preprocessing steps, not augmentations.

HalfBody

With probability p (default 0.3) the transform selects either the upper or lower body keypoints, computes a tight bounding box around the selected points, expands it by a factor of 1.5, and replaces the original target["box"]. This simulates partial‑occlusion scenarios.

def __call__(self, image, target):
    if random.random() < self.p:
        kps = target["keypoints"]
        vis = target["visible"]
        upper_kps, lower_kps = [], []
        for i, v in enumerate(vis):
            if v > 0.5:
                (upper_kps if i in self.upper_body_ids else lower_kps).append(kps[i])
        selected = upper_kps if random.random() < 0.5 else lower_kps
        if len(selected) > 2:
            selected = np.array(selected, dtype=np.float32)
            xmin, ymin = np.min(selected, axis=0)
            xmax, ymax = np.max(selected, axis=0)
            w, h = xmax - xmin, ymax - ymin
            if w > 1 and h > 1:
                xmin, ymin, w, h = scale_box(xmin, ymin, w, h, (1.5, 1.5))
                target["box"] = [xmin, ymin, w, h]
    return image, target

AffineTransform

The transform first adjusts the original bounding box to the fixed input size (256×192) while preserving aspect ratio, then optionally scales (random factor) and rotates (random angle between –45° and 45°). An affine matrix is computed from three reference points (center, top‑middle, right‑middle) and applied to both the image and the keypoint coordinates.

src_center = np.array([(src_xmin + src_xmax) / 2, (src_ymin + src_ymax) / 2])
src_p2 = src_center + np.array([0, -src_h / 2])   # top middle
src_p3 = src_center + np.array([src_w / 2, 0])    # right middle

dst_center = np.array([(fixed_w - 1) / 2, (fixed_h - 1) / 2])
trans = cv2.getAffineTransform(src, dst)
resize_img = cv2.warpAffine(img, trans, (fixed_w, fixed_h), flags=cv2.INTER_LINEAR)

RandomHorizontalFlip

When triggered (probability 0.5), the image is flipped horizontally ( np.flip(image, axis=[1])). Keypoint x‑coordinates are mirrored ( keypoints[:,0] = width - keypoints[:,0] - 1) and symmetric joint pairs defined in flip_pairs are swapped to keep left/right semantics correct.

# Flip image
image = np.ascontiguousarray(np.flip(image, axis=[1]))
# Flip keypoints
keypoints[:, 0] = width - keypoints[:, 0] - 1
for a, b in self.matched_parts:
    keypoints[[a, b]] = keypoints[[b, a]].copy()

KeypointToHeatMap

Each keypoint is rasterized into a Gaussian heatmap (default sigma = 2, kernel size ≈ 13×13). The heatmap resolution is 1/4 of the input image, so coordinates are divided by 4 and rounded. Only visible keypoints (visibility > 0.5) contribute; the Gaussian is clipped to the heatmap borders.

heatmap = np.zeros((num_kps, H, W), dtype=np.float32)
heatmap_kps = (kps / 4 + 0.5).astype(np.int)
for kp_id in range(num_kps):
    if visible[kp_id] < 0.5:
        continue
    x, y = heatmap_kps[kp_id]
    ul = [x - r, y - r]
    br = [x + r, y + r]
    # compute overlap with heatmap and copy Gaussian kernel
    heatmap[kp_id][img_y0:img_y1+1, img_x0:img_x1+1] = kernel[g_y0:g_y1+1, g_x0:g_x1+1]

After augmentation, the transformed image and its corresponding heatmap are fed to the HRNet backbone.

Network Construction

The HRNet architecture itself is not re‑implemented here; the author notes that building the network is straightforward once the code is examined, likening it to assembling building blocks.

Training and Inference

During training, a warm‑up learning‑rate scheduler is applied on the first epoch (if warmup=True), followed by the usual optimizer steps. The main training loop iterates over data_loader using a generator ( log_every) to keep memory usage low.

lr_scheduler = None
if epoch == 0 and warmup:
    warmup_factor = 1.0 / 1000
    warmup_iters = min(1000, len(data_loader) - 1)
    lr_scheduler = utils.warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor)

for i, [images, targets] in enumerate(metric_logger.log_every(data_loader, print_freq, header)):
    # forward, loss, backward, optimizer step …

For inference, the model outputs a heatmap tensor of shape (1, 17, 64, 48). The function get_max_preds extracts the arg‑max location for each joint, rescales it back to the original image using the inverse affine matrix, and applies a sub‑pixel refinement based on neighboring heatmap values.

def get_max_preds(batch_heatmaps):
    batch_size, num_joints, h, w = batch_heatmaps.shape
    heatmaps_reshaped = batch_heatmaps.view(batch_size, num_joints, -1)
    maxvals, idx = torch.max(heatmaps_reshaped, dim=2)
    preds = torch.zeros(batch_size, num_joints, 2).to(batch_heatmaps)
    preds[..., 0] = idx % w
    preds[..., 1] = torch.floor(idx / w)
    # mask low‑confidence points
    pred_mask = (maxvals > 0).float().unsqueeze(-1).repeat(1,1,2)
    preds *= pred_mask
    return preds, maxvals

After mapping the coordinates back to the original image space, a tiny offset ( ±0.25) is added based on the heatmap gradient to improve precision.

Conclusion

The article concludes that the HRNet source‑code analysis, especially the data‑augmentation pipeline and the training/inference utilities, provides a solid foundation for anyone wishing to experiment with human‑pose estimation or adapt the pipeline to other keypoint‑based tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision data augmentation Deep Learning PyTorch keypoint detection HRNet

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.