Design and Optimization of Bilibili's Large‑Scale Video Duplicate Detection System
This article describes the design, algorithmic improvements, and engineering performance optimizations of Bilibili's massive video duplicate detection (collision) system, covering challenges of low‑edit‑degree reposts, two‑stage retrieval, self‑supervised feature extraction, GPU‑accelerated preprocessing, and the resulting gains in accuracy and throughput.
Background : Bilibili faces a high volume of low‑edit duplicate video uploads that increase moderation workload and degrade user experience. A large‑scale video retrieval system (the "collision system") is needed to detect such duplicates by comparing new uploads against the entire historical video library.
Challenges : The system must achieve high precision and recall while processing 720p video at one frame per second within 10 seconds. Key difficulties include lack of pre‑trained features representing editing degree, low resolution causing loss of salient content, and the need for a two‑stage pipeline to handle billions of vectors efficiently.
Overall Architecture : The collision system consists of four subsystems – the main detection pipeline, a timeout fallback pipeline, downstream services (e.g., copyright), and a filtering module. The main pipeline performs video preprocessing, feature extraction, coarse‑grained candidate retrieval, and fine‑grained segment matching.
Algorithm Optimizations : A self‑supervised training pipeline builds an embedding extractor (ResNet‑50) that captures editing‑degree similarity. Image preprocessing removes black borders and isolates the core content using edge detection. Training uses dynamic negative‑sample queues and contrastive loss; additional tricks such as data augmentation, ViT‑teacher distillation, and 8‑bit quantization improve accuracy and inference speed.
Two‑Stage Matching : Coarse retrieval uses approximate nearest neighbor search with product quantization (PQ32) over >10⁹ vectors, followed by a fine‑grained segment‑level alignment using Hough transform‑based scoring, longest‑match extraction, and non‑maximum suppression to produce final duplicate decisions.
Engineering Performance Optimizations : Model inference is accelerated >5× using the in‑house InferX framework on NVIDIA GPUs. A custom GPU video decoder (NvCodec SDK) streams frames directly to CUDA tensors, eliminating CPU‑GPU copies. Image preprocessing, black‑border removal, and audio feature extraction (Log‑FilterBank, MFCC) are all GPU‑implemented, achieving 3× end‑to‑end speedup. Vector search leverages Faiss with sharding, PQ, and optional binary hashing to reduce memory and compute.
Results : Compared with the 2020 baseline, the system improves duplicate detection volume by 7.5×, recall by 3.75×, and accuracy by 2.2×, with model precision around 88 %. Human‑review miss rate dropped from 65 to 5 per day. The system now supports multiple Bilibili services, including safety review, copyright automation, and recommendation deduplication.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.