Building a Lightweight Disaster‑Recovery Drill System at Bilibili: Architecture, Practices, and Lessons
Bilibili’s infrastructure team created a lightweight, multi‑layered disaster‑recovery drill platform—combining an atomic fault library, scenario catalogs, chaos‑experiment orchestration, real‑time observation, and a product‑level interface—backed by standardized governance and CI‑integrated automation, cutting drill preparation from weeks to days and boosting weekly resilience testing across the organization.
Bilibili’s infrastructure team faces increasing stability challenges as digital transformation drives higher complexity and risk of service outages. To ensure business continuity while keeping costs low, they have built a lightweight disaster‑recovery (DR) drill system that supports high‑availability, multi‑active deployments and rapid fault recovery.
The system is organized into several layers. At the bottom is an Atomic Fault Library that provides basic fault injections such as CPU spikes, network latency, and packet loss. On top of this, two categories of fault scenarios are offered: generic scenarios (e.g., CPU overload, container failure) and specialized scenarios tailored to Bilibili’s internal services (e.g., dual‑active, multi‑active strategies).
Above the scenario layer lies the Chaos Experiment Layer , which handles experiment design, fault injection, and result collection across different execution environments. Real‑time Observation monitors recovery effectiveness and business‑level stability metrics, while Experience Accumulation creates reusable templates for complex business‑level drills (e.g., gateway, cross‑region, cross‑service).
The topmost Product Layer provides orchestration, fault generation, result analysis, and report generation, enabling users to manage the entire drill lifecycle from a single interface.
To promote widespread adoption, Bilibili has standardized the organizational structure and processes. A “Safety Production Committee” defines rules, baselines, and red‑lines, while white‑papers, training programs, and a “Safety Production Month” raise awareness. The team also runs a “Isolated Island” program that simulates real network cuts at the data‑center level.
Operationally, the drill workflow is split into two core modules: the Drill Execution Process (pre‑drill planning, execution, and recovery) and the Full‑Lifecycle Process (continuous improvement, automated scheduling, and post‑drill analysis). The execution process includes scenario selection, environment scoping, fault injection, metric monitoring, and safeguard mechanisms to prevent unintended production impact.
Automation is a key focus. The platform supports automatic validation of drill outcomes, scheduled execution, orchestration, and guard‑rail mechanisms that combine real‑time observability with rule‑based checks. Integration with CI pipelines enables “integration‑test drills” that run fault injections automatically after code commits, ensuring that new changes do not introduce hidden dependencies.
Through iterative improvements (Drill 3.0), Bilibili has reduced the planning‑to‑execution cycle from weeks to days, expanded participation from a few core teams to the entire organization, and increased drill frequency from quarterly to weekly. Automated acceptance and reusable test cases now provide a compounding effect on reliability engineering.
Overall, the lightweight DR drill system demonstrates how a combination of standardized fault libraries, modular architecture, clear organizational governance, and deep automation can achieve cost‑effective, high‑frequency resilience testing in a large‑scale internet company.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.