Operations 24 min read

How Bilibili Built a 5‑Year SRE Journey: High‑Availability, Multi‑Active, and Capacity Management

This article chronicles Bilibili's five‑year evolution of Site Reliability Engineering, detailing the introduction of SRE culture, the construction of high‑availability and multi‑active architectures, capacity management with Kubernetes, VPA/HPA, incident case studies, and the ongoing transformation of SRE practices across the organization.

Efficient Ops
Efficient Ops
Efficient Ops
How Bilibili Built a 5‑Year SRE Journey: High‑Availability, Multi‑Active, and Capacity Management

Bilibili SRE Development Over 5 Years

Before 2017 Bilibili had no SRE team; the focus was on efficiency and rapid response to changes. In 2018 SRE culture was introduced, laying the groundwork for business architecture understanding, multi‑active deployment, and the adoption of on‑call and post‑mortem practices.

2019–2021 Milestones

2019 saw the rollout of task automation, self‑service change approvals, and the exploration of Service Level Objectives (SLO) with service tiering, emergency response, and incident management frameworks. By 2020 the stability system was largely in place, incorporating pre‑incident response, post‑mortem analysis, chaos engineering, capacity planning, and containerized deployments. From 2021 onward the focus shifted to transitioning from on‑call to business‑process (BP) systems, optimizing multi‑active, service tiering, and SLO implementations.

Stability Assurance

The core stability stack includes high‑availability, multi‑active, capacity management, and activity protection. High‑availability is achieved through a layered architecture: CDN → Access Layer (LB/SLB/API Gateway) → Service Layer (BFF, services, jobs, admin). Middleware such as MQ, Canal, Notify, and caching, as well as storage (relational DB, KV, object storage, Elasticsearch), support observability and efficiency.

Access Layer High‑Availability

Typical failures include network outages, data‑center failures, component failures (e.g., SLB), and service failures. Mitigations involve DNS fallback, multiple CDN nodes, edge‑node degradation, cross‑region traffic routing, and automatic SLB failover. SLB rebuild time was reduced from one hour to five minutes after the 7.13 incident.

Service Layer Resilience

Service discovery via internal discovery service.

Multi‑active deployment across availability zones.

Load balancing using P2C algorithm based on CPU usage and latency.

Circuit breaking and two types of degradation: zone‑level and content‑quality.

Global and dynamic rate limiting using Golang Kratos framework and Google BBR algorithm.

Middleware Reliability

Critical dependencies include Redis (cluster, single node, Memcache), Kafka, and database proxies. Issues such as short‑connection storms and cache node overloads were mitigated by introducing unified proxy layers and sidecar deployments.

2021‑07‑13 Incident Case Study

At 10:52 PM users reported Bilibili outage; SLB failure was detected at 10:57 PM. After a series of investigations, a new SLB cluster was provisioned and traffic was switched, restoring core services by 1:40 AM. The incident highlighted insufficient isolation between user and internal networks, delayed multi‑active activation, and limited on‑call resources.

Multi‑Active Architecture

Business services are classified as Gzone (global shared, read‑write across zones), Rzone (sharded per zone), and Czone (read‑write in all zones with relaxed consistency). Logical availability zones were reorganized in Shanghai and Jiangsu to support same‑city dual‑active and cross‑region deployments.

Capacity Management and VPA

Capacity management focuses on bandwidth (access layer), compute (application layer), and storage resources. Kubernetes serves as the foundation, with VPA handling vertical scaling, HPA handling horizontal scaling, pool sharing, and quota management. VPA strategies are adjusted based on CPU usage ratios, ensuring efficient resource reclamation and cost reduction.

Activity Protection

For large events (e.g., live streams with millions of concurrent users), a workflow of activity understanding, capacity estimation, stress testing, rehearsal, checklist review, technical safeguards, and post‑mortem is followed. High‑availability mechanisms such as multi‑active, zone‑level degradation, circuit breaking, HPA, and VPA are key components.

SLO Practice and Service Grading

Following Google’s model, services are graded from L0 to L3, with business → application → API hierarchy. SLOs are defined per service tier, and metrics are aggregated from API level to business level. Reflections identified challenges in grading cost, metadata latency, limited SLI coverage, and insufficient alert consumption.

Incident Classification and SRE Evolution

Incident severity now considers business loss, service tier, and a primary‑scene impact coefficient, simplifying post‑mortem analysis. SRE roles are divided into daily operations, cross‑team coordination, and core platform support, requiring skills in operations, development, collaboration, and continuous learning.

SRE Training and Transformation

Training focuses on cultivating SRE culture, methodology (e.g., reading "Site Reliability Engineering" books), discussion forums, and development transition using Golang. The goal is to embed SRE practices throughout the organization and improve overall system reliability.

operationsHigh AvailabilityKubernetesSREmulti-activecapacity-management
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.