How Bilibili Cuts Video Bandwidth: Theory and Practice of Transcoding Optimization
This article analyzes the fundamental goals of video transcoding, presents an information‑theoretic framework for bitrate reduction, compares traditional and deep‑learning codecs, and shares Bilibili's practical system design, parameter‑decision strategies, and visual‑quality‑aware optimizations that dramatically lower bandwidth consumption.
01 General Transcoding Framework
Transcoding serves three main purposes: improving stream compatibility, enhancing visual quality, and reducing bitrate to save bandwidth. Modern transcoding pipelines are no longer simple decode‑re‑encode loops; they include pre‑analysis, preprocessing, parameter decision, terminal enhancement, and quality assessment modules, forming a large‑scale system that jointly optimizes compression by allocating more compute to achieve higher compression ratios.
02 Information‑Theoretic Principles
The core idea treats transcoding as a source‑coding problem: the goal is to represent video information with the fewest bits while preserving the parts that human eyes care about. Video entropy, defined by the joint probability distribution of all pixel values, is typically far lower than the raw pixel count because natural video contains many predictable patterns.
Two key theorems underpin the framework:
Cross‑entropy (the actual bitrate after entropy coding) is always greater than or equal to the true information entropy; better probability predictions lead to lower cross‑entropy.
Conditioning on previously encoded data reduces the required bits: if the decoder already knows related information, the remaining conditional information is smaller, so the bitrate drops.
These theorems explain why traditional codecs invest heavily in motion‑compensated prediction, intra‑prediction, and reference‑frame structures: they aim to provide as much related information as possible before encoding the residual (conditional information).
03 Bilibili Transcoding Optimization Practice
Bilibili entered the transcoding arena later than many peers and adopted a systematic approach to catch up. The strategy begins with identifying the underlying optimization principles rather than jumping straight into technology implementation, because different content types (e.g., OGV vs. UGV) have distinct requirements.
The practical system consists of a "general" transcoding framework, a scene‑level constant‑quality method, and a two‑pass deep‑learning bitrate‑factor predictor. Key components include:
Scene detection using the x264 scenecut algorithm.
Quality measurement primarily with VMAF; the assumption is that equal VMAF scores correspond to comparable perceived quality.
Parameter search (high‑accuracy, high‑cost) versus single‑pass ML‑based prediction (low‑cost, lower accuracy).
The two‑pass approach first predicts parameters with a lightweight neural network, achieving about 45 % accuracy. For the remaining 55 % of segments that fail the VMAF target, a second prediction incorporates the first‑pass result as an anchor, boosting overall accuracy to 99 % while keeping average encoding cost close to 1.55 passes.
Additional optimizations include visual‑lossless preprocessing that leverages three human‑vision priors: high sensitivity to structural loss (faces, eyes), low sensitivity to high‑frequency texture loss, and the DCT‑based coding principle that more zero coefficients mean lower bitrate. A custom loss function combining L1, SSIM, and DCT‑energy penalties trains a simple CNN to restore structure while suppressing high‑frequency noise, yielding roughly 15 % bitrate savings when placed before any standard encoder.
For quality assessment, Bilibili deployed a no‑reference VQA model to monitor platform‑wide video quality, but discovered limitations: dataset bias, temporal‑spatial sampling effects, and inherent human rating variance (±0.5 MOS). Consequently, ROI‑based encoding (prioritizing faces and bodies) and continuous refinement of VMAF‑based assumptions remain essential.
Looking forward, the roadmap emphasizes improving the accuracy of prior conditions (e.g., frame‑level quality control, better perceptual metrics) and exploring large‑model video codecs that encode video as probability‑table indices, effectively turning the codec into a prompt‑driven compression system.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
