Artificial Intelligence 25 min read

Quality‑Controlled Scene‑Adaptive Video Transcoding System at Bilibili

Bilibili’s quality‑controlled scene‑adaptive transcoding system automatically splits videos into shot‑level segments, predicts optimal encoding parameters with a deep‑learning model, applies two‑pass VMAF‑targeted encoding and ROI‑aware bitrate allocation, achieving stable visual quality, 99% accuracy, and roughly 15% bitrate reduction.

Bilibili Tech

Jul 1, 2022

Quality‑Controlled Scene‑Adaptive Video Transcoding System at Bilibili

Bilibili receives hundreds of thousands of video uploads daily. Popular videos attract most viewers and consume a large portion of bandwidth. To maintain visual quality while reducing bitrate, Bilibili re‑encodes hot videos with higher‑complexity algorithms, removing redundant data and achieving better compression.

The company has developed a quality‑controlled scene‑adaptive transcoding system . The system automatically splits each video into single‑shot scene segments, predicts the optimal encoding parameters for each segment, and then merges the encoded streams into a final compressed file.

Key results: the system can control the quality of each scene with 99% accuracy, leading to more stable visual quality, a noticeable improvement in picture quality, and a 15% reduction in bitrate, thus avoiding unnecessary bandwidth waste.

1. Segment Encoding + Quality Control – Instead of using a single set of parameters for the whole video, the system performs scene‑adaptive segment encoding, applying content‑aware parameters to each segment and explicitly targeting a quality goal rather than a fixed bitrate.

2. Scene‑Change Detection – The method compares the rate‑distortion cost of encoding a frame as an I‑frame versus a P‑frame. A large cost difference indicates a scene change. A high‑speed pre‑encoding algorithm (>300 fps) is used to detect these changes accurately.

3. Quality‑Controlled Encoding – The system uses VMAF (Video Multi‑method Assessment Fusion) as the objective quality metric because it correlates well with human perception. The goal is to keep the encoded video’s VMAF close to a predefined target.

Directly adjusting encoding parameters to meet a VMAF target is computationally expensive, as it normally requires repeated encoding and VMAF calculation. To overcome this, Bilibili treats the parameter search as a prediction problem and employs deep learning to predict the optimal rate‑factor (RF) before encoding.

Feature Extraction – Approximately 600 features are extracted from each video, including:

Spatial texture features based on Gray‑Level Co‑occurrence Matrix (GLCM).

Temporal correlation features using Normalized Correlation Coefficient (NCC).

Codec‑level statistics from the ultra‑fast pre‑encoding pass (e.g., average optimal block size, motion‑vector statistics).

Deep Neural Network – The prediction model consists of batch normalization, an attention module for automatic feature weighting, residual blocks for deep non‑linear representation, and a final fully‑connected layer that outputs the predicted RF.

The network is trained on about 20 million UGC videos (≈1 million single‑shot scenes) to minimize the mean‑square error between predicted and true RF values.

4. Two‑Pass RF Prediction Framework – After the first RF prediction and encoding, VMAF is computed. If the VMAF is within the target range, the result is kept; otherwise, the first‑pass RF and VMAF are fed back as additional features for a second prediction and re‑encoding. This approach achieves 98.8% accuracy (±1 VMAF error) with no quality‑loss cases, requiring on average only 1.55 encodings per segment.

5. ROI Encoding for Additional Protection – Because VMAF may underestimate quality loss in small, salient regions (e.g., faces), the system adds a Region‑Of‑Interest (ROI) encoding step that detects human and face areas and allocates extra bitrate to them, preserving subjective quality without significantly increasing overall bitrate.

Conclusion – The quality‑controlled scene‑adaptive transcoding system enables Bilibili to deliver stable, high‑quality video for popular content while reducing bitrate waste by about 15%. It combines accurate scene‑change detection, deep‑learning‑based RF prediction, a two‑pass refinement loop, and ROI‑aware bitrate allocation to meet both objective (VMAF) and subjective quality goals.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning quality control Video Transcoding VMAF parameter prediction ROI encoding scene adaptive encoding

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.