BILIVQA: Bilibili's No-Reference Video Quality Assessment System
BILIVQA is Bilibili’s deep‑learning, no‑reference video quality assessment system that trains on a proprietary 5,000‑video UGC dataset, extracts spatial and temporal features via MobileNet‑V2 and X3D, uses mixed‑dataset regression for strong generalization, and deploys a GPU‑optimized TensorRT pipeline with percentile‑based scoring for reliable quality monitoring and downstream applications.
Video quality assessment (VQA) is crucial for ensuring good user experience on video platforms, as it evaluates perceptual quality of videos across production, transcoding, and consumption stages.
The article distinguishes between reference VQA (e.g., PSNR, SSIM, VMAF) that requires a pristine reference video, and no‑reference VQA, which is essential when the original video is unavailable. Traditional no‑reference methods rely on handcrafted features and SVMs, while deep learning‑based approaches automatically learn features from data.
Bilibili developed BILIVQA, a deep learning no‑reference VQA model tailored to its diverse user‑generated content (UGC) which includes various genres, distortions, and formats. Public datasets such as LIVE‑VQC, KoNViD‑1k, and LSVQ are insufficient due to distribution mismatch, prompting Bilibili to build its own UGC video dataset of about 5,000 videos with MOS scores.
The model samples each video into clips: one spatial key frame per clip and 32 consecutive frames for temporal information. Spatial features are extracted by a MobileNet V2 pretrained on ImageNet, temporal features by an X3D Net pretrained on Kinetics‑400. Features are concatenated and fed into a prediction network with pooling and regression layers to output a clip score; the final video score is the average of clip scores.
Training uses a mixed‑dataset strategy: batches contain equal numbers of LSVQ and Bilibili videos, sharing feature extractors but employing separate regression heads. This improves generalization on Bilibili’s test set.
Performance is measured with PLCC (linear correlation) and SROCC (rank correlation). BILIVQA shows high accuracy on Bilibili’s own dataset and strong generalization on public benchmarks.
For efficient deployment, Bilibili implemented a pure‑hardware GPU pipeline: video decoding via GPU hardware, frame extraction, CUDA‑based resizing, and inference with a TensorRT‑optimized model, boosting GPU utilization.
To enable stable long‑term monitoring, Bilibili introduced the “BILIBILIVQA 质量量纲” mapping mechanism. A large unbiased benchmark set of 150,000 popular Bilibili videos is used to convert raw model scores into percentile‑based stable scores, ensuring that monitoring panels reflect true quality trends despite model updates.
Subjective experiments linked score intervals to perceived quality (poor, fair, excellent), providing actionable thresholds for low‑quality alerts, recommendation weighting, and quality‑guided processing.
The work concludes with plans to enlarge the UGC dataset, refine sampling strategies, and explore VQA applications in recommendation, encoding control, and video processing pipelines.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.