Artificial Intelligence 14 min read

How to Detect Video Copyright Infringement with Two‑Stage Frame Matching

This article details a two‑stage video copyright detection pipeline that builds a frame‑level feature library, uses Hessian‑Affine + SIFT and Fisher Vectors for robust feature extraction, applies weighted bipartite graph matching and longest increasing subsequence localization, and achieves an F1‑score of 0.9086 on the CCF 2019 competition dataset.

iQIYI Technical Product Team

Mar 13, 2020

How to Detect Video Copyright Infringement with Two‑Stage Frame Matching

Background

With the rapid growth of short‑video platforms, copyright infringement has become increasingly severe because digital media can be easily copied, edited, and redistributed. The 2019 CCF Big Data and Computational Intelligence Competition introduced a "Video Copyright Detection" track that requires participants to link a transformed short video back to its original long video, identify the exact time segment, and do so with robust visual features and high processing speed.

Solution Overview

The top‑three team (referred to as the "Lao Liang" team) achieved its results with a three‑part pipeline: feature extraction, infringing video retrieval, and infringing segment localization. The process is divided into two stages to improve accuracy while handling multiple candidate reference videos.

1. Feature Extraction

Each video frame is sampled and processed with Hessian‑Affine keypoint detection followed by SIFT descriptors to obtain local features. These local features are encoded into binary global descriptors using Fisher Vectors. To mitigate the semantic gap of hand‑crafted features, the team also incorporated deep‑learning‑based RMAC descriptors and fused their retrieval scores.

2. Infringing Video Retrieval (Stage 1)

Using faiss a coarse‑grained binary index (1 fps sampling) is built for all reference videos. For each query frame, the k =10 nearest reference frames (by Hamming distance) are retrieved. A weighted bipartite‑graph maximum‑matching algorithm then selects the reference video whose total matching weight is highest, ensuring one‑to‑one frame constraints.

3. Precise Segment Localization (Stage 2)

After identifying the candidate reference video, both query and reference frames are represented by global Fisher‑Vector descriptors. Similarity is computed with the QAGS (Query‑Based Asymmetric Gaussian Skipping) metric, which outperforms plain Hamming distance. For each query frame, the top 10 similar reference frames (5 fps sampling) are collected and ordered temporally to form matching pairs.

The problem is modeled as finding the longest increasing subsequence (LIS) in a weighted bipartite graph under dense‑matching constraints. Dynamic programming enumerates possible LIS candidates, updating both sequence length and cumulative weight when a longer or heavier match is found. Sparse outlier matches are filtered by enforcing temporal proximity, and a sliding‑window scan further refines the segment by selecting high‑weight matches that satisfy dense ordering.

Experimental Setup

All experiments were run on an AWS c5.4xlarge instance (16 × 3.6 GHz Intel Xeon, 32 GB RAM). The dataset comprised 200 reference long videos (MP4), 3 000 training short clips generated from them, and 1 500 test short clips. A validation set of 500 short clips was sampled from the training set.

Results

Stage‑1 retrieval achieved 95.8 % Top‑1 accuracy on the validation set. The full two‑stage pipeline obtained an F1‑score of 0.9086 on the test set with a matching error threshold of 3 seconds. Table 1 shows Top‑1/Top‑3/Top‑5 accuracies, and Table 2 reports F1‑scores and average detection time per short video, highlighting that finer frame sampling dramatically increases runtime.

Conclusion

The proposed method leverages highly discriminative, robust frame‑level features and a two‑stage matching strategy to deliver strong detection precision, ranking fourth in the competition. However, the approach suffers from high computational cost, especially during fine‑grained segment localization. Future work could explore more efficient similarity measures and feature fusion techniques to reduce runtime while preserving accuracy.

References

Araujo, A., & Girod, B. (2017). Large‑scale video retrieval using image queries. IEEE Transactions on Circuits and Systems for Video Technology, 28(6), 1406‑1420.

Du, S., Saha, A. K., & Johnson, D. B. (2007). RMAC: A routing‑enhanced duty‑cycle MAC protocol for wireless sensor networks. In IEEE INFOCOM 2007 (pp. 1478‑1486).

Johnson, J., Douze, M., & Jégou, H. (2019). Billion‑scale similarity search with GPUs. IEEE Transactions on Big Data.

Yang, Y., Tian, Y., & Huang, T. (2019). Multiscale video sequence matching for near‑duplicate detection and retrieval. Multimedia Tools and Applications, 78(1), 311‑336.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI feature extraction video copyright detection video retrieval frame matching

Written by

iQIYI Technical Product Team

The technical product team of iQIYI

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Solution Overview

1. Feature Extraction

2. Infringing Video Retrieval (Stage 1)

3. Precise Segment Localization (Stage 2)

Experimental Setup

Results

Conclusion

References

iQIYI Technical Product Team

How this landed with the community

Was this worth your time?

0 Comments

2. Infringing Video Retrieval (Stage 1)

3. Precise Segment Localization (Stage 2)