Operations 29 min read

How Bilibili’s ChangePilot Platform Reduces Production Risk with Structured Change Management

This article explains Bilibili’s approach to change management, defining change concepts, outlining a technical framework, detailing control levels, and describing the ChangePilot platform’s architecture, integration, and future directions to improve stability in large-scale cloud‑native environments.

dbaplus Community
dbaplus Community
dbaplus Community
How Bilibili’s ChangePilot Platform Reduces Production Risk with Structured Change Management

As enterprises grow and systems become more complex, production stability increasingly depends on rigorous change management; Google SRE reports that up to 70% of incidents stem from changes. The article introduces Bilibili’s change control platform, ChangePilot, by first defining change, its lifecycle, and the challenges of traditional ITIL‑style processes in cloud‑native, micro‑service environments.

Background and Problem

Current change practices focus on pre‑change reviews and post‑change records, lacking control over the execution phase. This leads to issues such as reliance on human experience, limited review capacity, inability to enforce execution policies, fragmented change definitions across platforms, and insufficient data for root‑cause analysis.

Defining Change

Change is any action that affects a service’s runtime state, causing a transition from a stable to an unstable state. The lifecycle includes initiation, execution (often in batches or traffic‑based rollouts), and verification.

Technical Framework

The authors treat change as an independent technical domain, creating a unified change information model with two parts: basic information (type, object, environment, time, personnel) and control information (purpose, scenario, scope, impact). This model supports multiple platforms (applications, databases, servers) and enables standardized APIs for change ingestion, perception, control, and analysis.

Control Levels (G0‑G4)

Based on the AlterShield taxonomy, five control levels are defined:

G0 – Event sync only, no control.

G1 – Single‑node pre‑ and post‑checks.

G2 – Full change ticket with batch control.

G3 – Adds pre‑ticket submission checks for non‑technical users.

G4 – Introduces unattended decision making for automated execution.

Each level specifies the number of change nodes, lifecycle stages, and suitable scenarios.

ChangePilot Architecture

The platform provides:

Standardized change ingestion via the unified model.

Broad change perception covering servers, networks, databases, middleware, and business data.

Risk‑based control using the defined levels and configurable defense items.

Change analytics, including association analysis using trace and CMDB topology.

Subscription mechanisms (IM, webhook, MQ) for downstream consumers.

Key interfaces include ChangeScene (defining a change scenario) and ChangeControl (methods InitChange, StartChangeStep, EndChangeStep, FinishChange) that platforms call at each node.

Implementation Details

Examples of Go structs illustrate the scene definition and control interface. The platform enforces checks such as pre‑flight validation, batch pre/post verification, and final integrity checks. Built‑in checks include change policy validation, SLO monitoring, and third‑party extensions via a Navigation interface.

Practical Adoption

ChangePilot is applied to Bilibili’s application releases (container and physical machine) and business gateway configuration changes, both using the G2 level. An SDK is planned to simplify integration, reducing the current intrusive API calls.

Operational Reflections

The authors discuss the trade‑off between risk reduction and efficiency, noting the need for emergency escape routes. Two exemption mechanisms are described: a short‑term “green channel” for rapid mitigation and a longer‑term whitelist requiring approval.

Future Directions

Plans include expanding coverage, adding more precise and intelligent checks (time‑series anomaly detection, log similarity analysis), and exploring LLM‑based change perception and analysis.

Overall, the article provides a comprehensive guide to designing, implementing, and operating a unified change management platform for large‑scale, cloud‑native services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud nativeplatform engineeringchange managementrisk controlProduction Stability
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.