Industry Insights 18 min read

BANG Engine: Multi‑Level Pipelines & GPU Acceleration for Faster Video Transcoding

To meet Bilibili’s demanding live and on‑demand video transcoding needs, the BANG engine combines a multi‑stage pipeline architecture, frame‑block and multi‑frame parallelism, SIMD‑based CPU acceleration, and TensorRT/TensorFlow GPU inference, offering configurable string‑based pipelines that dramatically increase throughput while simplifying integration.

Bilibili Tech

Nov 8, 2022

BANG Engine: Multi‑Level Pipelines & GPU Acceleration for Faster Video Transcoding

Bilibili has built a powerful video image analysis and processing engine called BANG (Bilibili video/image Analyzing and Processing Engine) to improve both video quality and transcoding efficiency for live and on‑demand services.

Multi‑Level Pipeline Architecture

Traditional single‑threaded transcoding processes decode, analyze, process, and encode each frame sequentially, resulting in low throughput. BANG replaces this with a multi‑level pipeline where decoding, analysis, processing, and encoding each have dedicated buffers and threads, allowing these stages to run in parallel.

Decoder pushes decoded frames into a decode queue.

Analysis threads read from the decode queue, extract ROI, quality metrics, etc., and place results in an analysis queue.

Processing threads consume frames and analysis results, apply configured image operations, and enqueue processed frames.

Encoder reads from the processing queue and outputs the final stream.

This design fully utilizes CPU/GPU resources and can achieve end‑to‑end speeds of 75 fps for 4K super‑resolution models, essentially delivering 100 % algorithm efficiency.

Accelerating Basic Algorithms

Algorithms are divided into traditional CPU‑based methods and deep‑learning models. Deep‑learning models are deployed via NVIDIA TensorRT, which converts models to mixed‑precision (FP32/FP16) and provides up to 100 % speedup over TensorFlow inference. For operations not supported by TensorRT, TensorFlow remains available, and future support for PyTorch is planned.

CPU‑based algorithms are optimized with SIMD instruction sets such as SSE and AVX, accelerating pixel copy, type conversion, and matrix addition, thereby reducing overall processing time.

Block‑Level and Multi‑Frame Parallelism

When a single GPU cannot meet real‑time requirements (e.g., <60 fps for 1080p enhancement), BANG adds sub‑threads inside processing threads. Two parallel strategies are offered:

Block‑level parallelism: each frame is split into multiple blocks, each processed by a separate thread. Suitable for pixel‑level tasks like denoising and enhancement.

Multi‑frame parallelism: multiple frames are processed concurrently by a single thread, beneficial for algorithms requiring global image context.

Both strategies can be combined, and each sub‑thread can bind to a specific GPU, enabling flexible multi‑GPU deployment.

String‑Based Pipeline Configuration

BANG exposes a factory‑style interface where all analysis and processing modules are registered. Users supply a single configuration string that the engine parses to instantiate the required algorithms and assemble the pipeline.

BANG="process='dnn_sr S12':scale=2x2:output_cuda_device='0'"

This example runs the S12 super‑resolution model, scales the output 2×, and keeps the result on GPU 0 for direct hardware encoding.

BANG="analyze=roi:process='dnn_sr S12|-face bicubic':scale=2x2:output_cuda_device='0'"

Here ROI detection is added, and a simple bicubic up‑scale is applied only to detected face regions to avoid over‑enhancement.

BANG="process='vod_code_process':frame_threads=8"

For on‑demand pre‑processing, eight frame‑processing threads are launched to remove visual redundancy before encoding.

BANG="analyze=roi:process='dnn_enhance KPL|-face bicubic':block_threads=2:gpu_list='1,2'"

This configuration enables ROI protection, block‑level parallelism with two threads per frame, and distributes the workload across two GPUs for KPL esports live streams, achieving stable 60 fps.

Performance Highlights

In the S12 4K super‑resolution project, the model alone runs at 75 fps; with BANG the end‑to‑end transcoding also reaches 75 fps. In KPL live streaming, enabling block‑level parallelism and dual‑GPU processing doubles throughput from ~45 fps to ~90 fps, comfortably exceeding the 60 fps real‑time target.

Conclusion

BANG provides a unified, plug‑and‑play engine that abstracts algorithm deployment, leverages multi‑level pipelines, SIMD/CPU acceleration, and TensorRT/TensorFlow GPU inference, and offers a concise string‑based configuration method. This combination dramatically improves throughput, simplifies integration for various Bilibili video services, and positions BANG as a scalable solution for future media processing challenges.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU Acceleration TensorRT parallel processing Bilibili Video Transcoding pipeline architecture

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.