Cloud Native 29 min read

ChangePilot: Bilibili’s Unified Change Management Platform and Practices

ChangePilot is Bilibili’s unified change‑management platform that standardizes change definition, lifecycle, and risk governance through a platform‑scenario model and five control levels (G0‑G4), offering built‑in checks, searchable records, subscription alerts, intelligent correlation, and emergency channels to boost production stability while maintaining operational efficiency.

Bilibili Tech
Bilibili Tech
Bilibili Tech
ChangePilot: Bilibili’s Unified Change Management Platform and Practices

With the rapid expansion of enterprise scale and increasing technical complexity, production‑environment stability has become a critical concern, especially for large Internet companies. Change events are the primary cause of production incidents (Google SRE reports that up to 70% of incidents are directly related to changes). Effective change management is therefore essential to mitigate risk and ensure reliability.

The article introduces Bilibili’s change‑control platform, ChangePilot, by first defining the core concept of a change, then describing the logical framework of change governance, and finally presenting the platform’s architecture, models, and practical experiences.

What is a change? A change is any activity that impacts the runtime state of a service, originating from development, SRE, product, or operations. It represents a transition from a stable state to an unstable one (entropy increase).

Change lifecycle includes initiation, execution, and completion. Traditional steps—plan, pre‑review, review, execution, and verification—are extended with batch‑wise gray‑release mechanisms to limit impact.

Change platform / scenario abstracts a change into a “platform” (the source of the change) and a “scenario” (the specific operation, e.g., database creation, instance scaling). The platform records both basic information (content schema) and control information (step schema).

The platform defines a unified type ChangeScene struct { // 归属变更平台 Platform // 关联变更资源类型 SourceType // 基础信息(变更内容)Schema ContentSchema // 管控信息(批次内容)Schema StepSchema // 检查项 []Navigation ... } model, and a control interface type ChangeControl interface { // 变更初始化 InitChange() // 变更结束 FinishChange() // 变更批次开始 StartChangeStep() // 变更批次结束 EndChangeStep() } . Each node in the change flow invokes these methods to perform admission checks, batch pre‑checks, post‑checks, and final verification.

Control levels (G0‑G4) categorize changes by risk and required governance:

G0 – Event sync only, no checks.

G1 – Single‑node pre‑admission and post‑integrity checks.

G2 – Full change ticket with batch execution, comprehensive pre‑, intra‑, and post‑checks.

G3 – Adds change‑request awareness before G2 steps.

G4 – Adds unattended decision capability after G3.

Each level adds corresponding nodes to the flow model, enabling fine‑grained risk control.

The platform also integrates built‑in check items such as type Navigation interface { ParamProvider() func(context.Context, Info) (Param, error) SpecProvider() func(context.Context, Info) (Spec, error) Execute(context.Context, Param, Spec) (Result, error) } , which can be extended with third‑party checks. Examples include mandatory gray‑release validation, SLO compliance checks, and custom business‑specific checks.

Platform capabilities include:

Change information ingestion via the unified model.

Rich change search (full‑text, CMDB‑based).

Subscription mechanisms (IM, webhook, MQ).

Intelligent analysis using trace data and CMDB topology to correlate changes across services.

Emergency escape channels (green channel for 1‑hour fast‑track, whitelist for longer exemptions).

Practice highlights :

Application release (container and bare‑metal) integrated at G2 level, with detailed batch information captured.

Business gateway configuration changes also managed at G2.

Future plans include SDKs to simplify integration, broader coverage, and AI‑enhanced anomaly detection (time‑series, log similarity) as well as LLM‑driven change perception. In summary, ChangePilot provides a standardized, multi‑level change governance framework that unifies change definition, lifecycle management, risk control, and intelligent analysis, helping Bilibili improve stability while balancing efficiency.

cloud-nativesoftware reliabilitySREchange managementrisk controlplatform engineering
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.