Artificial Intelligence 12 min read

Near-Duplicate Video Retrieval: Framework, Feature Extraction, Metric Learning, and Model Optimization

This article presents a comprehensive study of near‑duplicate video retrieval, covering the definition of near‑duplicate videos, motivations for deduplication, challenges, a two‑stage offline/online processing framework, keyframe and VGG16‑based feature extraction, metric‑learning loss functions, training procedures, dataset preparation, evaluation metrics, and model enhancements using LSTM and attention mechanisms.

HomeTech
HomeTech
HomeTech
Near-Duplicate Video Retrieval: Framework, Feature Extraction, Metric Learning, and Model Optimization

In recent years, short video sharing has become increasingly popular, leading to massive amounts of video data where many videos are near‑duplicates, i.e., they differ in illumination, editing, encoding, length, or other modifications.

What is a near‑duplicate video? It refers to videos that are visually similar but differ in aspects such as color, lighting, inserted logos, encoding parameters, file format, or duration. The definition is not standardized and can be adapted to specific business needs.

Why perform video deduplication? To protect user copyright, respect originality, and support various applications such as video‑database deduplication, search result re‑ranking, and personalized video recommendation.

Challenges include retrieval accuracy (requiring robust and scalable algorithms), retrieval speed (handling ever‑growing data volumes), and the semantic gap (videos with similar low‑level features may have very different meanings).

Framework

The near‑duplicate video retrieval system consists of offline and online processes. Offline processing extracts keyframes from the video database, computes keyframe features, calculates pairwise similarity for self‑deduplication, and stores the features. Online processing extracts keyframes from a query video, computes its features, compares them with stored features, and returns retrieval results.

Keyframe extraction

Short videos (~15 s): extract 10 fixed‑length keyframes or sample one frame per second for variable‑length features.

Long videos (~5 min): use the same method or adopt frame‑difference, global comparison, clustering, or event‑detection based extraction.

Feature extraction

Keyframe features are obtained by aggregating the intermediate convolutional layers of a pre‑trained VGG‑16 network. Each frame is resized to 224×224, max‑pooled, zero‑mean and L2‑normalized, yielding a 4096‑dimensional vector. All frame vectors of a video are concatenated (n×4096) and then averaged and normalized to produce a single 4096‑dimensional video representation.

Similarity measurement

Distance calculation: Euclidean distance between two video embeddings (after a DNN that reduces 4096‑dimensional features to 500 dimensions).

Similarity scoring: smaller distance indicates higher similarity; the similarity score for a query video q against a set of candidates M is computed as shown in the following formula.

Training process

A triplet‑based metric learning network is trained using a triplet loss function. Each triplet consists of a query video (anchor), a positive video (near‑duplicate), and a negative video (non‑duplicate). The loss encourages the distance between anchor and positive to be smaller than the distance between anchor and negative by a margin γ.

Optimization is performed with batch gradient descent, adding an L2 regularization term λ to prevent over‑fitting.

Dataset and evaluation

Training set: core dataset (queries and positives) and background dataset (negatives).

Test set: public CC_Web_Video dataset.

Metric: mean Average Precision (mAP).

Model optimization

To exploit temporal information, an LSTM layer is added to the feature aggregation stage. Additionally, an attention mechanism (based on equations (6)–(8)) is incorporated to focus on salient frame features.

Results and conclusion

The optimized model achieves 98.5% accuracy, reduces retrieval time from 7 s to 2 s, and identifies that 25% of videos in the internal short‑video library are duplicates.

References are listed at the end of the original document.

Attentionmapvideo deduplicationLSTMVGG16deep metric learningnear-duplicate retrieval
HomeTech
Written by

HomeTech

HomeTech tech sharing

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.