How Bilibili’s ChangePilot Platform Reduces Production Risk with Structured Change Management
This article explains Bilibili’s approach to change management, defining change concepts, outlining a technical framework, detailing control levels, and describing the ChangePilot platform’s architecture, integration, and future directions to improve stability in large-scale cloud‑native environments.
As enterprises grow and systems become more complex, production stability increasingly depends on rigorous change management; Google SRE reports that up to 70% of incidents stem from changes. The article introduces Bilibili’s change control platform, ChangePilot, by first defining change, its lifecycle, and the challenges of traditional ITIL‑style processes in cloud‑native, micro‑service environments.
Background and Problem
Current change practices focus on pre‑change reviews and post‑change records, lacking control over the execution phase. This leads to issues such as reliance on human experience, limited review capacity, inability to enforce execution policies, fragmented change definitions across platforms, and insufficient data for root‑cause analysis.
Defining Change
Change is any action that affects a service’s runtime state, causing a transition from a stable to an unstable state. The lifecycle includes initiation, execution (often in batches or traffic‑based rollouts), and verification.
Technical Framework
The authors treat change as an independent technical domain, creating a unified change information model with two parts: basic information (type, object, environment, time, personnel) and control information (purpose, scenario, scope, impact). This model supports multiple platforms (applications, databases, servers) and enables standardized APIs for change ingestion, perception, control, and analysis.
Control Levels (G0‑G4)
Based on the AlterShield taxonomy, five control levels are defined:
G0 – Event sync only, no control.
G1 – Single‑node pre‑ and post‑checks.
G2 – Full change ticket with batch control.
G3 – Adds pre‑ticket submission checks for non‑technical users.
G4 – Introduces unattended decision making for automated execution.
Each level specifies the number of change nodes, lifecycle stages, and suitable scenarios.
ChangePilot Architecture
The platform provides:
Standardized change ingestion via the unified model.
Broad change perception covering servers, networks, databases, middleware, and business data.
Risk‑based control using the defined levels and configurable defense items.
Change analytics, including association analysis using trace and CMDB topology.
Subscription mechanisms (IM, webhook, MQ) for downstream consumers.
Key interfaces include ChangeScene (defining a change scenario) and ChangeControl (methods InitChange, StartChangeStep, EndChangeStep, FinishChange) that platforms call at each node.
Implementation Details
Examples of Go structs illustrate the scene definition and control interface. The platform enforces checks such as pre‑flight validation, batch pre/post verification, and final integrity checks. Built‑in checks include change policy validation, SLO monitoring, and third‑party extensions via a Navigation interface.
Practical Adoption
ChangePilot is applied to Bilibili’s application releases (container and physical machine) and business gateway configuration changes, both using the G2 level. An SDK is planned to simplify integration, reducing the current intrusive API calls.
Operational Reflections
The authors discuss the trade‑off between risk reduction and efficiency, noting the need for emergency escape routes. Two exemption mechanisms are described: a short‑term “green channel” for rapid mitigation and a longer‑term whitelist requiring approval.
Future Directions
Plans include expanding coverage, adding more precise and intelligent checks (time‑series anomaly detection, log similarity analysis), and exploring LLM‑based change perception and analysis.
Overall, the article provides a comprehensive guide to designing, implementing, and operating a unified change management platform for large‑scale, cloud‑native services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
