Artificial Intelligence 15 min read

Mastering Video Point Tracking: Image Registration & SiamMask for the Marlan Cup

This article details the Marlan Mountain Cup video point‑tracking challenge, describing the dataset, evaluation metrics, and a hybrid solution that combines SIFT‑based image registration with SiamMask tracking, along with extensive analysis and practical tricks for performance improvement.

Maoyan Technology Team

Feb 1, 2021

Mastering Video Point Tracking: Image Registration & SiamMask for the Marlan Cup

1. Problem Analysis

1.1. Problem Introduction

The inaugural “Malan Mountain Cup” international audio‑video algorithm competition, organized by the China Society of Industrial and Applied Mathematics and hosted by Hunan provincial authorities, features three tracks: video point‑of‑interest tracking, video recommendation, and video quality restoration. The author participated in the point‑tracking track, achieving second place in both the preliminary and final rounds using a combination of image registration and tracking.

Video point‑of‑interest tracking involves locating a previously annotated region throughout a video sequence. It underpins dynamic ad insertion, which must blend ads seamlessly into video content without degrading viewer experience, requiring precise camera motion estimation, lighting, depth, and occlusion handling. The competition demands a mean‑square‑error (MSE) below 1 for the four‑corner coordinates of the tracked quadrilateral.

1.2. Data Description

Research data: 2,000 video clips (~100 frames each) with the first‑frame annotated quadrilateral for the intended ad region. Validation set: 100 clips with full‑frame trajectories of the four corners. Test set: two leaderboards (A and B) with 200 clips each; participants receive only the first‑frame corner coordinates.

1.3. Evaluation Metrics

The MSE is computed per frame, per video, and finally across the whole dataset. The formulas are illustrated by the following images:

2. Proposed Solution

2.1. Review of Common Approaches

Typical tracking pipelines use correlation filters (KCF), MeanShift, or modern track‑by‑detection methods. The competition poses two challenges: (1) the target box is an irregular quadrilateral, while most detectors assume rectangular anchors; (2) the region may be occluded or lack distinctive features.

The quadrilateral shape and high‑precision corner requirement invalidate standard KCF or anchor‑based detectors.

Occlusions and feature‑less targets further complicate tracking.

The problem lies between image registration (common in SLAM and medical imaging) and generic object tracking (VOT, MOT). A hybrid approach was adopted.

2.1.1. Image Registration

Standard registration aligns two images via feature extraction, matching, and transformation. Classical pipelines use SIFT+Nearest‑Neighbor (NN) or ORB for efficiency; deep‑learning alternatives include SuperPoint and SuperGlue. Due to the lack of training data, the author chose SIFT+NN, which, despite occasional mismatches, yields reliable homographies.

2.1.2. Image Tracking

The SiamMask network provides a mask‑aware tracker; its center prediction helps calibrate the quadrilateral, while similarity scoring incorporates tracking knowledge.

2.2. Full Pipeline

The pipeline consists of two stages: (1) position prediction to retrieve candidate boxes; (2) similarity computation to select the optimal box. Position prediction aggregates global, local, and corner‑region features, while similarity scoring evaluates each candidate against the reference.

2.2.1. Position Prediction

Global features use SIFT with NN matching; local features combine SIFT and corner points (optical flow for verification); corner‑region features rely on template matching due to sparse points. Homographies are estimated per region and fused.

2.2.2. Similarity Computation

Similarity is measured by enlarging the diagonal region to amplify coordinate errors, warping to a rectangle, and applying filtering plus SSIM. No deep metric‑learning models were tested due to time constraints.

3. Performance Optimization

3.1. Analysis

Comparison of registration vs. tracking:

Image registration : Accurate corner coordinates when the homography is correct, but prone to mismatches in repetitive scenes.

Object tracking : Provides reasonable global location for single targets, yet struggles with irregular quadrilaterals and severe occlusions.

3.2. Effective Improvements

3.2.1. Feature‑Point Partitioning

Dividing features into global, local, and corner regions improves matching precision and focuses attention on the tracking area, especially for videos with heavy blur where foreground dominates.

3.2.2. Similarity Tuning

Expanding the diagonal region and using SSIM mitigates the impact of small coordinate errors; deeper metric‑learning could further improve results.

3.2.3. Optical‑Flow Augmentation

When SIFT yields few points, corner detection plus optical flow adds matches; the reference frame is chosen from the temporally nearest candidate to handle background motion.

3.2.4. Rerank Strategy

After similarity scoring, re‑ranking based on region priority helps select the best box when similarity scores plateau or occlusions cause low confidence.

3.3. Unsuccessful Attempts

3.3.1. Other Traditional Features

ORB and similar descriptors performed worse than SIFT; deep‑learning features may help if fine‑tuned on competition data.

3.3.2. KCF and Similar Trackers

Traditional trackers could not meet the quadrilateral precision requirement and were outperformed by the registration‑centric approach.

4. Recommended Tricks

4.1. Visualize and Quantify Bad Cases

Without ground‑truth, a heuristic similarity metric was used as confidence to enable rapid iteration during the competition.

4.2. Borrow Ideas from Related Papers

Combining SIFT registration with SiamMask provided a global reference that improved early results, though it slightly regressed later.

4.3. SuperPoint + SuperGlue

These methods offer strong matching capabilities; while not used in the final solution due to data constraints, they remain promising for future work.

Reference

1. SuperPoint: Self‑Supervised Interest Point Detection and Description

2. SuperGlue: Learning Feature Matching with Graph Neural Networks

3. SiamMask documentation

4. Megvii CVPR 2020 SLAM Challenge

computer vision image registration MSE evaluation SiamMask video tracking

Written by

Maoyan Technology Team

Code your life of light and shadow

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.