Industry Insights 13 min read

How Bilibili’s Video Cloud Tackles Spatial Video and Audio: From MV‑HEVC to Audio Vivid

This article examines Bilibili Video Cloud’s technical exploration of spatial video and spatial audio, comparing traditional 2D encoding with MultiView (MV‑HEVC), detailing the implementation of MV‑HEVC to SBS conversion, and describing the integration of Audio Vivid technology for immersive 3‑D sound.

Bilibili Tech
Bilibili Tech
Bilibili Tech
How Bilibili’s Video Cloud Tackles Spatial Video and Audio: From MV‑HEVC to Audio Vivid

Overview

Advances in transmission, display, and compute have raised user expectations for immersive audio‑video experiences, driving the development of spatial video and spatial audio solutions.

Spatial Video

Background

Human depth perception relies on binocular disparity. Traditional 2D video delivers identical frames to both eyes, while spatial video provides slightly different views for each eye, increasing immersion. The video‑cloud team focuses on encoding methods: conventional 2D packing and Apple’s MultiView (MV‑HEVC) encoding.

2D Packing Formats

Left‑ and right‑eye frames are packed into a single 2D frame, allowing reuse of existing 2D codecs. This reduces development cost but cannot exploit inter‑eye redundancy, resulting in 50%‑100% higher bandwidth compared with native 3D coding.

HSBS (Half‑width side‑by‑side): source 1920×1080, per‑eye 960×1080, transmission 1920×1080.

FSBS (Full‑width side‑by‑side): source 1920×1080, per‑eye 1920×1080, transmission 3840×1080.

HOU (Half‑height over‑under): source 1920×1080, per‑eye 1920×540, transmission 1920×1080.

FOU (Full‑height over‑under): source 1920×1080, per‑eye 1920×1080, transmission 1920×2160.

MultiView (MV‑HEVC) Encoding

MV‑HEVC exploits inter‑eye redundancy, achieving 20%‑30% better compression than 2D HEVC while remaining backward compatible (single‑eye playback works as regular 2D video). The ecosystem is limited to Apple devices (iPhone 15 Pro series and Vision Pro for capture).

Cloud‑Side Transcoding Solution

To support on‑demand services, the following two‑step workflow is implemented:

Accept MV‑HEVC uploads from iPhone users.

Transcode MV‑HEVC to side‑by‑side (SBS) format in the cloud for broader playback compatibility.

Because open‑source MV‑HEVC to SBS conversion was unavailable, a custom pipeline was built.

Spatial Video Recognition

The pipeline parses MP4 boxes defined in Apple’s documentation using command‑line tools to detect MV‑HEVC streams.

HTM Decoder Integration

The JCT‑3V HTM codec library provides MV‑HEVC decoding capability. A filter mvhevc_mp4toannexb injects the required lhvC metadata into the bitstream, allowing extraction of layer‑0 (primary view) and layer‑1 (secondary view) raw data.

SBS Frame Generation

VPS and SEI information from the bitstream map layers to eye views. The pipeline aligns frames, stitches left/right images, and re‑encodes to the desired SBS format (HSBS or FSBS). The overall flow is illustrated in the diagram below.

空间视频转码流程示意图
空间视频转码流程示意图

Spatial Audio

Background

Since 2020, the platform has integrated Dolby Atmos and later adopted the AI‑driven three‑dimensional sound standard “Audio Vivid” (UWA alliance), which is referenced in China’s 4K UHD TV implementation guide.

Audio Vivid Technology

Audio Vivid extends channel‑based audio with spatial cues and supports three rendering models:

Channel‑based (e.g., 5.1/7.1).

Bed‑plus‑object: a static “bed” signal plus dynamic objects carrying position, intensity, and size metadata.

Higher‑Order Ambisonics (HOA) for full 3‑D sound fields.

Objects can be combined with bed or channel signals, enabling flexible rendering on arbitrary speaker configurations.

Engineering Implementation

The cloud provides a unified Audio Vivid processing chain:

Pass‑through of the original Audio Vivid track for capable endpoints.

Generation of a binaural stereo track for devices lacking Audio Vivid support.

Reference code from the UWA alliance originally handled file‑based streams on Windows. The team refactored it to:

Port the decoder to Linux.

Replace file I/O with in‑memory streams.

Integrate decoding, binaural rendering, and re‑encoding into a single transcoding binary, eliminating intermediate files.

This streaming‑style binary can demux, decode, render, and encode Audio Vivid content with minimal latency and storage overhead, which is critical for live streaming.

Modifications to Bento4 enable DASH packaging of Audio Vivid streams.

Audio Vivid服务端和终端处理流程示意图
Audio Vivid服务端和终端处理流程示意图

Reference Materials

Multiview High Efficiency Video Coding (MV‑HEVC) – http://hevc.info/mvhevc

H.265/HEVC Specification – https://www.itu.int/rec/T-REC-H.265

Apple HEVC Stereo Video Profile – https://developer.apple.com/av-foundation/HEVC-Stereo-Video-Profile.pdf

Audio Vivid Technical Whitepaper (V1.0)

UWA 009.1‑2022 3‑D Audio Specification – Part 1: Coding, Distribution, and Presentation

Video EncodingMV-HEVCcloud videospatial audiomultiviewspatial videoAudio Vivid
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.