AI‑Driven Video Quality Enhancement and Low‑Bitrate High‑Resolution Techniques at Bilibili
Bilibili’s Cloud Multimedia team uses AI‑driven pipelines to cut bandwidth costs while delivering low‑bitrate, high‑quality video, employing a QoE‑based decision engine, real‑time 4K super‑resolution for game streams, low‑rank reconstruction for narrow‑band HD, data‑driven HDR LUTs, and explores diffusion‑based restoration for legacy content.
The video streaming sector is entering its second half of the competition, facing a sharp trade‑off between user experience and bandwidth cost, especially under the industry’s “winter” backdrop. Cheng Chao from Bilibili’s Cloud Multimedia platform shared advanced experiences and ideas accumulated during the rapid growth of Bilibili’s video business.
Cheng introduced himself as a member of the algorithm team that empowers Bilibili live streaming with AI‑based cost‑reduction and efficiency‑boosting solutions.
The core mantra of the talk was “cost reduction and efficiency” (降本增效) and the technical goal of achieving “low‑bitrate high‑quality” (低码高画) video delivery.
The presentation was divided into four parts: (1) the AI‑driven video‑cloud quality‑enhancement pipeline; (2) a concrete 4K real‑time super‑resolution case for game live streaming; (3) AI‑based low‑rank reconstruction for narrow‑band HD pre‑processing; (4) a brief summary and future outlook.
AI Quality‑Enhancement Chain – The legacy video‑cloud pipeline relied on a static quality‑matrix and manual operation, leading to latency and inefficiency. The new pipeline introduces a QoE‑driven decision module that extracts ~40 video features (information density, noise level, temporal analysis, business indicators, etc.) and feeds them into a decision engine. This engine selects the appropriate quality‑enhancement path (e.g., super‑resolution, frame‑interpolation, restoration) based on the feature vector, allowing arbitrary processing chains to be expressed as a compute‑graph, much like building with LEGO blocks.
In practice, the new pipeline dramatically shortened the production cycle for a high‑profile promotion (Jay Chou’s new album). By applying AI‑based 4K up‑conversion, the workflow was reduced from 3–4 hours to about one hour, enabling Bilibili to be the only platform offering 4K live streaming for the event.
4K Real‑Time Super‑Resolution for Game Live – The team analyzed performance constraints (frame rate, hardware resources) and compared frame‑parallel vs. block‑parallel schemes, noting the latency trade‑offs of each. For game streams, they leveraged the repetitive high‑frequency textures of game video, using a GAN with adversarial loss, pixel loss, and a custom edge‑loss based on Sobel operators to preserve sharp textures while avoiding typical GAN artifacts. The network architecture replaces the popular ESA attention with a lightweight high‑pass channel attention, and employs re‑parameterization to merge skip, 1×1, and 3×3 convolutions into a single 3×3 kernel at inference time. An “Unshuffle” operation (the inverse of PixelShuffle) is used to down‑sample feature maps without information loss, reducing compute in the head/tail of the network. The final model runs at 75 fps on a single V100 GPU and achieves 106 K parameters (<1 M), delivering 4K 60 fps super‑resolution for live game streams.
Low‑Rank Reconstruction for Narrow‑Band HD Pre‑Processing – Traditional approaches either rely on DCT‑based coding (which discards high‑frequency information uniformly) or aggressive denoising (which also removes useful signal). The proposed method treats the problem as a low‑rank matrix reconstruction: the reconstructed image A must have low rank (few frequency components) while the residual E = D‑A remains perceptually small. Perceptual constraints are enforced with a combination of pixel‑loss, LPIPS (perceptual loss), and FID. A spectral‑entropy loss is introduced in a two‑stage training pipeline: stage‑1 produces a coarse reconstruction, stage‑2 learns an attention‑like weighting over DCT coefficients of each block to minimize the entropy of the resulting spectrum, effectively approximating the low‑rank constraint. This yields a model with only ~3 K parameters that runs at >300 fps on a T4 GPU, providing substantial bitrate savings (≈16 % overall, >18 % for 1080p+ resolutions).
HDR Pipeline – The HDR conversion consists of three steps: (1) Color Conversion (global exposure and color adjustment on 8‑bit), (2) Global Color Mapping (mapping bt709 to bt2020 10‑bit), and (3) Local Enhancement (fine‑grained adjustments). The talk highlighted a data‑driven 3D‑LUT approach where a network predicts per‑clip LUT weights to achieve scene‑aware color grading.
Summary & Outlook – The speaker concluded with ongoing work on “high‑blur restoration” for legacy footage. Large diffusion models (e.g., LDM) fine‑tuned with LoRA on low‑resolution/high‑resolution pairs show promising texture recovery, though temporal consistency remains a challenge. Continued research on video‑specific degradation models and model‑level optimization is expected to further close the gap between low‑quality legacy content and modern high‑definition standards.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.