Artificial Intelligence 12 min read

Near-Duplicate Video Retrieval: Framework, Feature Extraction, Metric Learning, and Model Optimization

This article presents a comprehensive study of near‑duplicate video retrieval, covering the definition of near‑duplicate videos, motivations for deduplication, challenges, a two‑stage offline/online processing framework, keyframe and VGG16‑based feature extraction, metric‑learning loss functions, training procedures, dataset preparation, evaluation metrics, and model enhancements using LSTM and attention mechanisms.

HomeTech

Aug 7, 2019

Near-Duplicate Video Retrieval: Framework, Feature Extraction, Metric Learning, and Model Optimization

In recent years, short video sharing has become increasingly popular, leading to massive amounts of video data where many videos are near‑duplicates, i.e., they differ in illumination, editing, encoding, length, or other modifications.

What is a near‑duplicate video? It refers to videos that are visually similar but differ in aspects such as color, lighting, inserted logos, encoding parameters, file format, or duration. The definition is not standardized and can be adapted to specific business needs.

Why perform video deduplication? To protect user copyright, respect originality, and support various applications such as video‑database deduplication, search result re‑ranking, and personalized video recommendation.

Challenges include retrieval accuracy (requiring robust and scalable algorithms), retrieval speed (handling ever‑growing data volumes), and the semantic gap (videos with similar low‑level features may have very different meanings).

Framework

The near‑duplicate video retrieval system consists of offline and online processes. Offline processing extracts keyframes from the video database, computes keyframe features, calculates pairwise similarity for self‑deduplication, and stores the features. Online processing extracts keyframes from a query video, computes its features, compares them with stored features, and returns retrieval results.

Keyframe extraction

Short videos (~15 s): extract 10 fixed‑length keyframes or sample one frame per second for variable‑length features.

Long videos (~5 min): use the same method or adopt frame‑difference, global comparison, clustering, or event‑detection based extraction.

Feature extraction

Keyframe features are obtained by aggregating the intermediate convolutional layers of a pre‑trained VGG‑16 network. Each frame is resized to 224×224, max‑pooled, zero‑mean and L2‑normalized, yielding a 4096‑dimensional vector. All frame vectors of a video are concatenated (n×4096) and then averaged and normalized to produce a single 4096‑dimensional video representation.

Similarity measurement

Distance calculation: Euclidean distance between two video embeddings (after a DNN that reduces 4096‑dimensional features to 500 dimensions).

Similarity scoring: smaller distance indicates higher similarity; the similarity score for a query video q against a set of candidates M is computed as shown in the following formula.

Training process

A triplet‑based metric learning network is trained using a triplet loss function. Each triplet consists of a query video (anchor), a positive video (near‑duplicate), and a negative video (non‑duplicate). The loss encourages the distance between anchor and positive to be smaller than the distance between anchor and negative by a margin γ.

Optimization is performed with batch gradient descent, adding an L2 regularization term λ to prevent over‑fitting.

Dataset and evaluation

Training set: core dataset (queries and positives) and background dataset (negatives).

Test set: public CC_Web_Video dataset.

Metric: mean Average Precision (mAP).

Model optimization

To exploit temporal information, an LSTM layer is added to the feature aggregation stage. Additionally, an attention mechanism (based on equations (6)–(8)) is incorporated to focus on salient frame features.

Results and conclusion

The optimized model achieves 98.5% accuracy, reduces retrieval time from 7 s to 2 s, and identifies that 25% of videos in the internal short‑video library are duplicates.

References are listed at the end of the original document.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

attention MAP video deduplication LSTM VGG16 deep metric learning near-duplicate retrieval

Written by

HomeTech

HomeTech tech sharing

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.