How Bilibili Built a High‑Availability Multi‑Active Architecture for SRE
This article details Bilibili's SRE team's design and implementation of a high‑availability multi‑active architecture, covering zone types, same‑city and cross‑region deployments, traffic routing, cache consistency, message handling, governance, and practical lessons learned from real‑world incidents.
High‑Availability Multi‑Active Architecture
Multi‑active (active‑active) architecture replicates services across one or more data centers, allowing traffic to be scheduled to any available zone. It reduces outage impact compared with traditional single‑active disaster‑recovery.
Zone Types (CRG)
GZone : Data shared among users (e.g., video playback, live streams). Suitable for platform‑wide services.
RZone : Unit‑level data such as comments, bullet screens, payments. Each unit is isolated.
CZone : Hybrid of GZone and RZone; supports read‑write within a local zone while tolerating limited latency and inconsistency.
Same‑City Multi‑Active
Latency is kept <5 ms. Traffic enters through a DCDN layer that hashes the user MID or device ID to a specific zone. A custom Picker module supports dynamic weight adjustments (e.g., 99:1, 50:50). After DCDN, traffic passes through a 7‑layer SLB and an API Gateway (APIGW) deployed on a PaaS platform with HPA, providing API degradation, circuit breaking, and rate limiting.
Requests are proxied to caches and databases via a unified Proxy layer, enabling local reads/writes and back‑origin writes to the primary zone. An Invoker component provides global traffic control and service publishing.
Cache Consistency
A long‑lived connection proxy abstracts cache access for sidecar and proxy‑less SDKs. Consistency follows the Cache‑Aside pattern: database/KV changes are captured by Canal, published to a message queue, and processed by background jobs that update or invalidate the cache.
Message Multi‑Active Modes
Zone‑local production/consumption (no cross‑zone sync).
Global full consumption: topics are bidirectionally synced across zones.
Custom global consumption with selective zone disabling.
Modes are configured per business without hard‑binding topics.
Data Access / Storage Layer
MySQL, TiDB and Taishan (KV) are exposed via a unified Proxy that routes reads/writes based on zone, discovers topology automatically, and handles failover.
GZone : Master‑slave replication; reads can be served from local replicas, writes are routed back to the master. Strong consistency can be forced via SQL hints or connection‑string flags.
RZone : Requires sharding and bidirectional sync; each shard is read‑write within its zone.
DTS : Sub‑10‑second bidirectional replication with conflict detection; conflicts can pause sync or be delivered to a message queue for business‑level handling.
Business Multi‑Active Evolution
Services progress from single‑active → read‑only multi‑active → full same‑city multi‑active. The Invoker platform and APIGW manage service publishing and traffic orchestration. Cross‑region multi‑active is being validated using a distant South‑China zone for read‑only traffic.
Multi‑Active Governance
Metadata Rule Governance
Legacy CDN regex rules were consolidated into prefix‑based routing in APIGW, providing a unified traffic control plane while still allowing custom front‑end rules.
Invoker Platform Resilience
Core dependencies are stored in CMDB; the platform runs in GZone mode with fault‑injection drills, degradation plans, and an emergency super‑admin for login failures.
Traffic Orchestration
The platform supports declarative traffic orchestration, pre‑checks (capacity, monitoring, DB pool, SLB limits), and automated cut‑over visualisation. SLO monitoring is integrated to observe end‑to‑end link health.
Effectiveness Validation
Dependency visualization using tracing.
Tagging of strong vs. weak dependencies.
Traffic‑closed‑loop verification.
Automated fault‑drill SDK that discovers dependencies, runs drills, and generates compliance reports.
Key Q&A
Q1: Is dual‑active infrastructure still required when the application layer already supports multi‑active?
A1: Yes. All components (authentication, approval, etc.) must be designed for dual‑active operation to ensure high availability during zone failures.
Q2: How is data consistency guaranteed?
A2: Same‑city uses primary‑write/replica‑read with optional forced master reads for strong consistency. Cross‑region adopts sharding, dual‑write, and tolerates latency; cache consistency follows the Cache‑Aside approach described above.
Q3: How are cost and ROI managed?
A3: Benefits are measured by reduced outage impact. Costs are controlled by leveraging elastic scaling, shared same‑city resources, and platform automation (Invoker) to replace manual operations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
