Artificial Intelligence 15 min read

Technical Overview of Bilibili Vision Toolkit (BVT): Architecture, Features, and FFmpeg Filter Integration

The Bilibili Vision Toolkit (BVT) is a C++ SDK that unifies multimedia AI algorithms through a low‑coupling core, modular dynamic libraries, and a multi‑engine backend, enabling configurable DAG pipelines, asynchronous parallel execution, and seamless FFmpeg filter integration for high‑performance, cross‑platform video processing.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Technical Overview of Bilibili Vision Toolkit (BVT): Architecture, Features, and FFmpeg Filter Integration

The article introduces Bilibili Vision Toolkit (BVT), a C++‑based SDK that consolidates various multimedia AI algorithms (e.g., super‑resolution, face enhancement, video frame interpolation, narrow‑band HD) and provides a unified API for backend integration. BVT serves as an engineering "base" for AI inference and video processing pipelines, enabling high performance, heterogeneous computing, and multi‑platform support.

2 BVT Technical Analysis

BVT is organized into a low‑coupling core layer and a modular layer, with a backend engine layer that abstracts multiple inference engines. The core layer handles task scheduling and provides C API entry points, while the modular layer implements concrete AI algorithms as dynamic libraries loaded at runtime. This design promotes code reuse, extensibility, and easy configuration of custom task graphs.

2.1 Overall Architecture and Workflow

The system consists of an application layer (e.g., a FFmpeg filter), the core layer, the modular layer, and the engine layer. The application invokes BVT APIs to request tasks such as super‑resolution or face enhancement. The core layer schedules these tasks, loads the appropriate modules, and delegates inference to the engine layer (TensorRT, LibTorch, OpenVINO, etc.).

2.4 Custom Task Flow

BVT allows users to define custom pipelines via configuration files. The pipeline is represented as a Directed Acyclic Graph (DAG) and is executed by a built‑in graph engine combined with a thread pool for parallel processing. An example pipeline processes an input image by detecting a ROI, running face‑enhancement and super‑resolution in parallel, and finally applying color correction.

2.3 Data Representation

Data exchanged between modules is encapsulated in a Packet whose payload is a Tensor . Tensors are abstracted to support various tensor libraries (LibTorch, Eigen) and device buffers (CPU, CUDA). Memory management uses pooling and reference counting to reduce allocation overhead and avoid unnecessary copies.

2.4 Multi‑Inference Engine Support

BVT abstracts inference through a unified interface, supporting engines such as TensorRT, LibTorch, OpenVINO, OnnxRuntime, and TensorFlow. Model files are packaged with a model.json descriptor that specifies the required engine, version, and I/O signatures, enabling runtime engine selection and dynamic plugin loading.

2.5 Module Decoupling

Modules are built as independent dynamic libraries, allowing applications to link only the lightweight core static library. At runtime, the core loads the needed modules and corresponding inference engines, which is especially useful for differentiating VOD (large module set) and live (small, low‑latency module set) scenarios.

2.6 Asynchronous Parallel Execution

BVT adopts an asynchronous, non‑blocking API. Requests are submitted without waiting for completion; the caller polls request status. The framework also includes a device scheduler that distributes work across multiple GPUs, achieving multi‑device parallelism.

2.7 API Design

The C API revolves around three concepts: session, task, and request. A session can host multiple tasks; each task can submit multiple requests. The API maintains state across these contexts, which is essential for streaming video processing.

2.8 FFmpeg Filter Implementation

The article provides a concrete example of integrating BVT as an FFmpeg filter using the activate() callback for asynchronous processing. The typical workflow includes:

init(): call bvt_env_create() to set up the environment and load module libraries.

query_formats(): negotiate pixel formats.

config_props(): parse filter parameters, create a BVT session and task via bvt_session_create() and bvt_task_create() .

activate(): for each input frame, wrap buffers with bvt_buffer_create() , submit a request with bvt_request_create() , poll with bvt_request_poll() , and forward completed frames.

uninit(): clean up with bvt_task_destroy() , bvt_session_destroy() , and bvt_env_destroy() .

Filter parameter example:

bvt=module={module_path}:task={task_name}:model={model_dir}:gpus={gpu_list}:cuda_in={bool}:cuda_out={bool}:task_opt='{'task_specific_params'}'

The article concludes with a summary of BVT’s impact on Bilibili’s VOD and live streaming services, emphasizing improved development efficiency, runtime performance, and extensibility across devices and inference engines.

SDKcFFmpegmultimedia AIBVTinference engines
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.