Evolution of Bilibili's Voice Chat Room Architecture: From Live Streaming to Multi‑User RTC Interaction
The article chronicles Bilibili’s voice‑chat room transformation from a simple one‑to‑one live‑streaming setup to a scalable multi‑host RTC system, detailing the new session‑channel model, server‑driven mic‑seat management, extensive monitoring, state‑synchronization techniques, revenue‑engine integration, domain‑driven design, and continuous‑delivery practices.
This article presents a detailed case study of the architectural evolution of Bilibili's voice chat room (语聊房) over the past year. It begins with an introduction of the authors—senior development engineers at Bilibili—and a preface that frames the system as a continuously evolving product built on the live‑streaming infrastructure.
The narrative first outlines the business background, describing two major development phases: (1) the exploratory stage in July 2022, when the product was limited to one‑to‑one interactions (host‑to‑host video PK) and (2) the strategic upgrade in 2023, which accelerated feature delivery and introduced higher quality requirements. Each phase imposed new constraints on the existing architecture.
To understand the subsequent changes, the article reviews the underlying live‑streaming architecture, defining key terms such as Capture, Encoding, Streaming, Transcoding, Distribution, Playback, and the P0/P1 APIs used by B‑side (broadcaster) and C‑side (viewer) clients. It explains how the original one‑to‑one model could not support multiple simultaneous hosts, prompting the need for a multi‑host interaction capability.
A new interaction model is introduced, based on the concepts of "session", "conversation", and "channel". A session represents a multi‑host activity created by a host; each participant joins a conversation that maps to a unique RTC channel. This abstraction enables multiple hosts to interact simultaneously while preserving the ability for the session to persist after the creator leaves.
The article then discusses the migration from a B‑to‑C model, extending real‑time audio/video interaction to mobile clients and allowing audience participation. It outlines the required functional changes: real‑time audio/video support for users, mic‑seat management (mute, kick, status), gifting for interactive users, and a bidirectional data flow between broadcaster and audience.
Mic‑seat management is moved from the client to the server to handle complex allocation logic at scale. The control transfer process is described, emphasizing compatibility between old and new versions and the need for server‑side computation, storage, and push mechanisms.
Operational concerns are addressed through a comprehensive monitoring stack: infrastructure monitoring (CPU, memory, network), request‑level metrics (QPS, latency, SLO), and business‑level metrics (conversion funnels, state changes). The article also details strategies for real‑time state synchronization, including long‑link push, pull‑based polling, gateway caching, and SEI‑based data embedding in video streams.
Further sections cover the development of a revenue‑related interaction engine, the implementation of a full‑stack diagnostic platform for end‑to‑end issue tracing, and the adoption of domain‑driven design principles (bounded contexts, subdomains) to keep the model coherent across safety, interaction, and other concerns.
Finally, the article outlines best practices for continuous delivery: automated code quality checks in CI/CD, extensive automated testing, performance monitoring of critical APIs, and daily health checks. It concludes with a list of reference articles on Marxist philosophy, enterprise architecture, evolutionary architecture, domain‑driven design, and continuous delivery.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.