How Bilibili’s Video Cloud Tackles Spatial Video and Audio: From MV‑HEVC to Audio Vivid
This article examines Bilibili Video Cloud’s technical exploration of spatial video and spatial audio, comparing traditional 2D encoding with MultiView (MV‑HEVC), detailing the implementation of MV‑HEVC to SBS conversion, and describing the integration of Audio Vivid technology for immersive 3‑D sound.
Overview
Advances in transmission, display, and compute have raised user expectations for immersive audio‑video experiences, driving the development of spatial video and spatial audio solutions.
Spatial Video
Background
Human depth perception relies on binocular disparity. Traditional 2D video delivers identical frames to both eyes, while spatial video provides slightly different views for each eye, increasing immersion. The video‑cloud team focuses on encoding methods: conventional 2D packing and Apple’s MultiView (MV‑HEVC) encoding.
2D Packing Formats
Left‑ and right‑eye frames are packed into a single 2D frame, allowing reuse of existing 2D codecs. This reduces development cost but cannot exploit inter‑eye redundancy, resulting in 50%‑100% higher bandwidth compared with native 3D coding.
HSBS (Half‑width side‑by‑side): source 1920×1080, per‑eye 960×1080, transmission 1920×1080.
FSBS (Full‑width side‑by‑side): source 1920×1080, per‑eye 1920×1080, transmission 3840×1080.
HOU (Half‑height over‑under): source 1920×1080, per‑eye 1920×540, transmission 1920×1080.
FOU (Full‑height over‑under): source 1920×1080, per‑eye 1920×1080, transmission 1920×2160.
MultiView (MV‑HEVC) Encoding
MV‑HEVC exploits inter‑eye redundancy, achieving 20%‑30% better compression than 2D HEVC while remaining backward compatible (single‑eye playback works as regular 2D video). The ecosystem is limited to Apple devices (iPhone 15 Pro series and Vision Pro for capture).
Cloud‑Side Transcoding Solution
To support on‑demand services, the following two‑step workflow is implemented:
Accept MV‑HEVC uploads from iPhone users.
Transcode MV‑HEVC to side‑by‑side (SBS) format in the cloud for broader playback compatibility.
Because open‑source MV‑HEVC to SBS conversion was unavailable, a custom pipeline was built.
Spatial Video Recognition
The pipeline parses MP4 boxes defined in Apple’s documentation using command‑line tools to detect MV‑HEVC streams.
HTM Decoder Integration
The JCT‑3V HTM codec library provides MV‑HEVC decoding capability. A filter mvhevc_mp4toannexb injects the required lhvC metadata into the bitstream, allowing extraction of layer‑0 (primary view) and layer‑1 (secondary view) raw data.
SBS Frame Generation
VPS and SEI information from the bitstream map layers to eye views. The pipeline aligns frames, stitches left/right images, and re‑encodes to the desired SBS format (HSBS or FSBS). The overall flow is illustrated in the diagram below.
Spatial Audio
Background
Since 2020, the platform has integrated Dolby Atmos and later adopted the AI‑driven three‑dimensional sound standard “Audio Vivid” (UWA alliance), which is referenced in China’s 4K UHD TV implementation guide.
Audio Vivid Technology
Audio Vivid extends channel‑based audio with spatial cues and supports three rendering models:
Channel‑based (e.g., 5.1/7.1).
Bed‑plus‑object: a static “bed” signal plus dynamic objects carrying position, intensity, and size metadata.
Higher‑Order Ambisonics (HOA) for full 3‑D sound fields.
Objects can be combined with bed or channel signals, enabling flexible rendering on arbitrary speaker configurations.
Engineering Implementation
The cloud provides a unified Audio Vivid processing chain:
Pass‑through of the original Audio Vivid track for capable endpoints.
Generation of a binaural stereo track for devices lacking Audio Vivid support.
Reference code from the UWA alliance originally handled file‑based streams on Windows. The team refactored it to:
Port the decoder to Linux.
Replace file I/O with in‑memory streams.
Integrate decoding, binaural rendering, and re‑encoding into a single transcoding binary, eliminating intermediate files.
This streaming‑style binary can demux, decode, render, and encode Audio Vivid content with minimal latency and storage overhead, which is critical for live streaming.
Modifications to Bento4 enable DASH packaging of Audio Vivid streams.
Reference Materials
Multiview High Efficiency Video Coding (MV‑HEVC) – http://hevc.info/mvhevc
H.265/HEVC Specification – https://www.itu.int/rec/T-REC-H.265
Apple HEVC Stereo Video Profile – https://developer.apple.com/av-foundation/HEVC-Stereo-Video-Profile.pdf
Audio Vivid Technical Whitepaper (V1.0)
UWA 009.1‑2022 3‑D Audio Specification – Part 1: Coding, Distribution, and Presentation
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
