Beihang Team's Video Copyright Detection Solution: Frame Sampling, Feature Extraction, and Retrieval Matching
The Beihang University team’s video copyright detection solution samples frames every 200 ms, extracts 512‑dimensional ResNet‑18 features, and uses handcrafted cosine‑similarity matching to identify source videos and plagiarized segments, all while operating on limited hardware without training any models.
This article presents the solution of the Beihang University team in the CCF BDCI video copyright detection competition. The approach does not rely on any trained machine‑learning models; deep learning is used only for feature extraction, while the rest of the pipeline is handcrafted to run on limited hardware (500 GB disk and a single 1080 Ti GPU).
The overall workflow consists of three stages: video frame sampling, feature extraction, and retrieval‑matching.
1. Video Frame Sampling – The team first explored the dataset and observed that the query and reference sets have inconsistent video lengths and frame rates (FPS). To unify processing, they adopted a uniform sampling interval of 200 ms: for 25 FPS videos they sample every 5 frames, for 15 FPS every 3 frames, and for 10 FPS every 2 frames. This mitigates FPS differences and reduces storage requirements.
2. Feature Extraction – After frame extraction, all frames are resized to 224×224 and passed through a ImageNet‑pretrained ResNet‑18 model to obtain 512‑dimensional feature vectors. Because the competition added black borders to increase noise, the team first crops the borders. They note that resizing to 224×224 and using an un‑fine‑tuned ResNet‑18 limits discriminative power, suggesting that larger models and task‑specific fine‑tuning would improve results.
3. Retrieval‑Matching Algorithm Design
After obtaining frame‑level features, the matching process is divided into three sub‑steps:
3.1 Identify the suspected source video – For a query video with m frames, the feature matrix Q_v (m×512) is normalized. For each reference video with n frames, its matrix R_v (n×512) is also normalized. Cosine similarity is computed via matrix multiplication, yielding an m×n similarity matrix. For each query frame, the most similar reference frame is selected, forming a similarity vector V_similar (size m ) and a corresponding vector of most‑similar reference frames V_similarframes . The mean similarity across all reference videos is compared, and the reference video with the highest mean is taken as the suspected source.
3.2 Locate the plagiarized segment in the query video – The maximum value max of V_similar defines a threshold threshold = max - 0.1 . Scanning from the start, the first position where the next K consecutive similarity scores exceed the threshold is taken as the segment start, where K = max(int((V_similar > line).sum() * 0.04), 1) . A symmetric scan from the end determines the segment end.
3.3 Locate the corresponding segment in the reference video – Using the start frame of the query segment, the algorithm finds the matching reference frame positions from V_similarframes . By averaging the offsets of several consecutive matched frames (after removing outliers), the start position in the reference video is estimated; the end position is derived analogously.
Challenges and Reflections – The team faced severe hardware constraints (only one 1080 Ti GPU) while processing hundreds of gigabytes of video data. Their mitigation strategies included lowering the sampling rate and avoiding any training phase. They concluded that algorithmic ingenuity, hardware resources, and domain experience are all critical for success in video copyright detection.
Takeaways – Emphasize algorithm design over raw computational power, validate on local datasets, stay adaptable to unexpected changes, and continuously learn from more experienced teams.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.