Artificial Intelligence 17 min read

Design and Optimization of Bilibili's Large‑Scale Video Duplicate Detection System

This article describes the design, algorithmic improvements, and engineering performance optimizations of Bilibili's massive video duplicate detection (collision) system, covering challenges of low‑edit‑degree reposts, two‑stage retrieval, self‑supervised feature extraction, GPU‑accelerated preprocessing, and the resulting gains in accuracy and throughput.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Design and Optimization of Bilibili's Large‑Scale Video Duplicate Detection System

Background : Bilibili faces a high volume of low‑edit duplicate video uploads that increase moderation workload and degrade user experience. A large‑scale video retrieval system (the "collision system") is needed to detect such duplicates by comparing new uploads against the entire historical video library.

Challenges : The system must achieve high precision and recall while processing 720p video at one frame per second within 10 seconds. Key difficulties include lack of pre‑trained features representing editing degree, low resolution causing loss of salient content, and the need for a two‑stage pipeline to handle billions of vectors efficiently.

Overall Architecture : The collision system consists of four subsystems – the main detection pipeline, a timeout fallback pipeline, downstream services (e.g., copyright), and a filtering module. The main pipeline performs video preprocessing, feature extraction, coarse‑grained candidate retrieval, and fine‑grained segment matching.

Algorithm Optimizations : A self‑supervised training pipeline builds an embedding extractor (ResNet‑50) that captures editing‑degree similarity. Image preprocessing removes black borders and isolates the core content using edge detection. Training uses dynamic negative‑sample queues and contrastive loss; additional tricks such as data augmentation, ViT‑teacher distillation, and 8‑bit quantization improve accuracy and inference speed.

Two‑Stage Matching : Coarse retrieval uses approximate nearest neighbor search with product quantization (PQ32) over >10⁹ vectors, followed by a fine‑grained segment‑level alignment using Hough transform‑based scoring, longest‑match extraction, and non‑maximum suppression to produce final duplicate decisions.

Engineering Performance Optimizations : Model inference is accelerated >5× using the in‑house InferX framework on NVIDIA GPUs. A custom GPU video decoder (NvCodec SDK) streams frames directly to CUDA tensors, eliminating CPU‑GPU copies. Image preprocessing, black‑border removal, and audio feature extraction (Log‑FilterBank, MFCC) are all GPU‑implemented, achieving 3× end‑to‑end speedup. Vector search leverages Faiss with sharding, PQ, and optional binary hashing to reduce memory and compute.

Results : Compared with the 2020 baseline, the system improves duplicate detection volume by 7.5×, recall by 3.75×, and accuracy by 2.2×, with model precision around 88 %. Human‑review miss rate dropped from 65 to 5 per day. The system now supports multiple Bilibili services, including safety review, copyright automation, and recommendation deduplication.

deep learningvector searchFeature Extractionvideo deduplicationlarge-scale retrievalBilibili
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.