Operations 20 min read

How Bilibili Built a High‑Availability Multi‑Active Architecture for SRE

This article details Bilibili's SRE team's design and implementation of a high‑availability multi‑active architecture, covering zone types, same‑city and cross‑region deployments, traffic routing, cache consistency, message handling, governance, and practical lessons learned from real‑world incidents.

dbaplus Community
dbaplus Community
dbaplus Community
How Bilibili Built a High‑Availability Multi‑Active Architecture for SRE

High‑Availability Multi‑Active Architecture

Multi‑active (active‑active) architecture replicates services across one or more data centers, allowing traffic to be scheduled to any available zone. It reduces outage impact compared with traditional single‑active disaster‑recovery.

Zone Types (CRG)

GZone : Data shared among users (e.g., video playback, live streams). Suitable for platform‑wide services.

RZone : Unit‑level data such as comments, bullet screens, payments. Each unit is isolated.

CZone : Hybrid of GZone and RZone; supports read‑write within a local zone while tolerating limited latency and inconsistency.

Same‑City Multi‑Active

Latency is kept <5 ms. Traffic enters through a DCDN layer that hashes the user MID or device ID to a specific zone. A custom Picker module supports dynamic weight adjustments (e.g., 99:1, 50:50). After DCDN, traffic passes through a 7‑layer SLB and an API Gateway (APIGW) deployed on a PaaS platform with HPA, providing API degradation, circuit breaking, and rate limiting.

Requests are proxied to caches and databases via a unified Proxy layer, enabling local reads/writes and back‑origin writes to the primary zone. An Invoker component provides global traffic control and service publishing.

Cache Consistency

A long‑lived connection proxy abstracts cache access for sidecar and proxy‑less SDKs. Consistency follows the Cache‑Aside pattern: database/KV changes are captured by Canal, published to a message queue, and processed by background jobs that update or invalidate the cache.

Message Multi‑Active Modes

Zone‑local production/consumption (no cross‑zone sync).

Global full consumption: topics are bidirectionally synced across zones.

Custom global consumption with selective zone disabling.

Modes are configured per business without hard‑binding topics.

Data Access / Storage Layer

MySQL, TiDB and Taishan (KV) are exposed via a unified Proxy that routes reads/writes based on zone, discovers topology automatically, and handles failover.

GZone : Master‑slave replication; reads can be served from local replicas, writes are routed back to the master. Strong consistency can be forced via SQL hints or connection‑string flags.

RZone : Requires sharding and bidirectional sync; each shard is read‑write within its zone.

DTS : Sub‑10‑second bidirectional replication with conflict detection; conflicts can pause sync or be delivered to a message queue for business‑level handling.

Business Multi‑Active Evolution

Services progress from single‑active → read‑only multi‑active → full same‑city multi‑active. The Invoker platform and APIGW manage service publishing and traffic orchestration. Cross‑region multi‑active is being validated using a distant South‑China zone for read‑only traffic.

Multi‑Active Governance

Metadata Rule Governance

Legacy CDN regex rules were consolidated into prefix‑based routing in APIGW, providing a unified traffic control plane while still allowing custom front‑end rules.

Invoker Platform Resilience

Core dependencies are stored in CMDB; the platform runs in GZone mode with fault‑injection drills, degradation plans, and an emergency super‑admin for login failures.

Traffic Orchestration

The platform supports declarative traffic orchestration, pre‑checks (capacity, monitoring, DB pool, SLB limits), and automated cut‑over visualisation. SLO monitoring is integrated to observe end‑to‑end link health.

Effectiveness Validation

Dependency visualization using tracing.

Tagging of strong vs. weak dependencies.

Traffic‑closed‑loop verification.

Automated fault‑drill SDK that discovers dependencies, runs drills, and generates compliance reports.

Key Q&A

Q1: Is dual‑active infrastructure still required when the application layer already supports multi‑active?

A1: Yes. All components (authentication, approval, etc.) must be designed for dual‑active operation to ensure high availability during zone failures.

Q2: How is data consistency guaranteed?

A2: Same‑city uses primary‑write/replica‑read with optional forced master reads for strong consistency. Cross‑region adopts sharding, dual‑write, and tolerates latency; cache consistency follows the Cache‑Aside approach described above.

Q3: How are cost and ROI managed?

A3: Benefits are measured by reduced outage impact. Costs are controlled by leveraging elastic scaling, shared same‑city resources, and platform automation (Invoker) to replace manual operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsSREmulti-activeBilibilicloud infrastructure
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.