Video Copyright Detection Algorithm: Competition Solution Overview
The Hulu Brothers’ competition solution tackles large‑scale video copyright detection by extracting I‑frames, encoding them with ResNet‑18 CNN features, performing approximate nearest‑neighbor search and ORB re‑ranking to match queries to reference videos, then linearly interpolating frame correspondences for precise temporal alignment, achieving high precision, recall and F1 scores.
Video copyright detection is a key algorithm for video retrieval and copyright protection, representing a cutting‑edge research direction. It combines image retrieval, image verification, and video information, posing significant challenges in robustness, speed, and concurrency, especially as large‑scale infringing short videos undergo complex transformations.
This document presents the solution of the "Hulu Brothers" team for the CCF BDCI video copyright detection competition.
Problem Description
The competition evaluates the ability to associate transformed short videos with their original long videos and to locate the corresponding time segments. The data consist of long copyright videos (originating from iQIYI) and short infringing videos generated by applying various transformations to the long videos.
Formally, let A be the set of long videos, B the set of extracted clips, B' the transformed clips, C the short videos composed from B', and D the final infringing video set, with C and A disjoint.
Evaluation Metrics
Submissions are scored using the F1‑score. A prediction is considered correct (TP) if the predicted long‑video ID matches and the start/end time error is within 5 seconds. Incorrect ID or time error >5 s counts as FP; missing or wrong predictions count as FN. Precision, recall, and F1 are computed per class.
Solution Overview
The task is divided into two sub‑tasks: (1) video matching – finding the reference video for each query video; (2) temporal alignment – locating the start and end times in both videos.
The overall pipeline converts video retrieval into image retrieval by extracting key frames and applying image‑level similarity techniques.
3.1 Frame Extraction
Key frames (I‑frames) are extracted from both query and reference videos because they contain complete information and are fewer in number, making them suitable for robust retrieval.
3.2 Image Feature Extraction
Both global (e.g., color histograms, image hashes, CNN‑based features) and local (e.g., keypoint descriptors) image features are considered. Global CNN features are particularly effective for semantic encoding.
3.3 Query‑Reference Video Matching
1. Extract ResNet‑18 convolutional features from key frames and L2‑normalize them. 2. For each query frame, perform approximate nearest‑neighbor search (using the HNSW library) against reference frames. Two versions are used: a basic version retrieving the most similar frame above a similarity threshold, and an improved version retrieving the top‑100 candidates, followed by ORB keypoint matching for re‑ranking. 3. Aggregate frame‑level matches per query video using frequency and similarity scores to determine the best matching reference video. This approach achieved a 2,650/3,000 correct match rate on the training set.
3.4 Temporal Alignment
Given matched key frames, the method assumes consistent frame spacing and similar playback speed (1× or 1.2×). Using the correspondence of consecutive key frames, the start and end times (qstart, qend) in the query video are mapped to the reference video times (rstart, rend) by linear interpolation based on the known speed ratio.
3.5 Experimental Results
The solution demonstrated high precision and recall on the competition dataset, with detailed performance curves shown in the original figures.
Experience Sharing
The team emphasizes the importance of extensive data preprocessing and a stable offline validation pipeline. They treat the task as Near‑Duplicate Video Detection, reviewing related academic literature and finding that existing methods fall short on temporal localization, prompting their own efficient approach.
Team Introduction
The "Hulu Brothers" team comprises members from industry and academia, with backgrounds in AI, computer vision, and data science. Members include Liu Yuzhong (JD Retail, algorithm engineer), Chen Jianqiu (University of New South Wales), Shi Jia (University of California), Yang Ye (University of Melbourne), and Miao Shilei (JD Retail).
References
[1] Jiang QY et al., "SVD: A Large‑Scale Short Video Dataset for Near‑Duplicate Video Retrieval", ICCV 2019. [2] Yang C. et al., "Million‑scale near‑duplicate video retrieval system", MM 2011. [3] Chou C‑L et al., "Pattern‑based near‑duplicate video retrieval and localization on web‑scale videos", TMM 2015. [4] Datar M et al., "Locality‑sensitive hashing scheme based on p‑stable distributions", SCG 2004. [5] Hao Y et al., "Stochastic multiview hashing for large‑scale near‑duplicate video retrieval", TMM 2017. [6] Jiang Y‑G et al., "VCDB: A large‑scale database for partial copy detection in videos", ECCV 2014. [7] Kordopatis‑Zilos G et al., "Near‑duplicate video retrieval by aggregating intermediate CNN layers", MM 2017. [8] Shen F et al., "Hashing on non‑linear manifolds", TIP 2015. [9] Song J et al., "Multiple feature hashing for real‑time large‑scale near‑duplicate video retrieval", MM 2011. [10] Wu X et al., "Practical elimination of near‑duplicates from web video search", MM 2007. [11] Xiao C et al., "Efficient similarity joins for near‑duplicate detection", TODS 2011. [12] Zhang D et al., "Self‑taught hashing for fast similarity search", SIGIR 2010. [13] Zhou W et al., "BSIFT: toward data‑independent codebook for large scale image search", TIP 2015.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.