Backend Development 18 min read

Applying CQRS Architecture to Live Streaming Room Service: Design, Evolution, and Operational Practices

The live‑streaming room service was re‑architected using CQRS, dividing read‑heavy viewer functions from write‑intensive broadcaster operations, splitting the monolith into focused Go micro‑services, adding multi‑level caching, event‑driven sync, extensive observability, and automated incident‑response to achieve massive scalability and rapid fault recovery.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Applying CQRS Architecture to Live Streaming Room Service: Design, Evolution, and Operational Practices

The room system is the cornerstone of live‑streaming services, supporting both broadcast creation and audience consumption. Over time its architecture has evolved to handle massive traffic peaks (e.g., S11 reaching tens of millions of concurrent users) through multi‑active reads, chaos traffic governance, hotspot detection, and multi‑level caching.

To meet growing scalability demands, the team adopted a CQRS (Command‑Query Responsibility Segregation) approach, separating high‑read user scenarios (Query) from write‑intensive, strongly consistent broadcast‑creation scenarios (Command). This enables stateless, unlimited horizontal scaling for read‑heavy services while preserving strong consistency for write operations.

From a micro‑service perspective, the monolithic room service was first split into several Golang services: xroom/daoanchor (core room service), xanchor (anchor business), xroom-management (room control), and xroom-extend (room extensions). The split aligns with organizational domains (B‑side for broadcasters, C‑side for viewers) and reduces coupling between business logic, read/write load, and front‑back boundaries.

Key architectural insights include:

Applying CQRS when a single data model cannot efficiently satisfy both read and write patterns.

Using domain events to synchronize B‑side (write‑heavy) and C‑side (read‑heavy) services, ensuring eventual consistency while allowing independent scaling.

Implementing multi‑level caching (local cache, Redis cluster) to sustain >350k QPS normal traffic and peaks of 180k QPS per interface.

Designing gray‑scale release controls via distributed KV switches to gradually roll out feature flags and traffic splits.

Observability is built around tracing, metrics, and logging, with dedicated CQRS dashboards monitoring publish latency, network transfer, and subscriber processing times. System robustness is enhanced by dual data‑sync mechanisms (message queue + direct service calls) and automated retry/recovery logic for write failures.

Data verification jobs compare binlog streams or perform cross‑database reconciliation within a 30‑second window, triggering alerts and self‑healing scripts when inconsistencies are detected.

An incident‑response SOP defines alert routing, severity classification, and step‑by‑step remediation procedures, leveraging tracing and automated diagnostics to minimize downtime.

Production support follows a “diagnostic‑desk” model, providing rapid fault isolation and resolution, improving mean‑time‑to‑repair by roughly 80%.

Finally, the article discusses technical project management practices: baseline alignment with business growth, milestone tracking, formal announcements, and post‑mortem reviews to ensure that architectural investments deliver measurable value.

References include Microsoft’s CQRS pattern guide, a detailed CQRS article by Kislay Verma, and the GRASP object‑oriented design wiki.

backend architectureLive StreamingmicroservicesObservabilityDomain-Driven Designsystem reliabilityCQRS
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.