Cloud Native 17 min read

Safe Change Management in Bilibili's Cloud‑Native Container Platform Caster

The paper describes Bilibili’s Caster platform, which implements standardized workflows, left‑shifted pre‑checks, tiered release checkpoints, and an emergency green‑channel to safely manage containerized application changes, providing real‑time observability, automated rollback, and capacity‑aware scaling that together cut change‑induced incidents and improve production stability.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Safe Change Management in Bilibili's Cloud‑Native Container Platform Caster

Cloud‑native technologies such as containers, immutable infrastructure and declarative APIs separate business workloads from underlying hardware and architecture, improving portability, environment consistency and operational efficiency. However, the migration of many applications to containerized, micro‑service architectures and the shift of release and operation processes to PaaS increase system complexity and raise new stability requirements for production environments.

Stability is a universal concern for Internet companies. In Bilibili, more than 70% of incidents since 2022 are caused by changes and coding issues, often aggravated by missing observability metrics, absent gray‑release monitoring, and incomplete rollback mechanisms. As business scale grows, organizational complexity further increases change‑risk exposure.

The paper presents the design and implementation of several key capabilities built on Bilibili’s container platform Caster, focusing on safe change management for containerized applications.

Overall Design

The design follows four principles: (1) standardize the development workflow, (2) shift control points left, (3) enforce tiered release checkpoints, and (4) provide emergency escape mechanisms. Specific measures include environment‑order validation, build and release log control, branch governance, rollback plan management, and diff‑based change awareness for configurations, images, middleware and capacity.

Tiered release introduces application grades (L0, L1, L2/L3) with staged rollout percentages and observation windows (e.g., 5 min per stage). The system monitors business SLOs, capacity metrics (QPS, CPU, memory) and reports real‑time statistics (instance ratios, capacity, release status) to a centralized change‑control platform.

Technical Architecture

The architecture consists of four layers: product layer (exposes change‑awareness, subscription, search, defense strategy configuration, and green‑channel features), change‑defense layer (window control, constraint checks, SLO & saturation metrics, multi‑level observation points, adaptive scaling), change‑analysis layer (impact, risk and observability analysis with pre‑check reports), and integration with surrounding platforms (change‑control platform, quality platform, observability platform).

Core Processes

Release pre‑check moves risk detection to the start of the release by checking image change logs, branch policies, rollback plans, known‑bug versions, dependency versions, capacity configurations and SLO risks. The pre‑check responsibilities are split between the container release platform and the change‑control platform.

Tiered release enforces mandatory observation windows at each stage to limit fault propagation. Pre‑release conditions include fine‑grained SLO metrics per version, emergency escape capabilities, and green‑channel fast‑track for critical bugs or traffic spikes.

Release logic adapts to different deployment patterns (single‑zone, multi‑zone, iterative vs. blue‑green) and supports both batch and streaming release modes. The following code snippet shows the initialization and parameter calculation for a phased publish:

// 发布阶段初始化
... // 根据dep类型、部署策略、是否属于组合发布等信息判断是当前是否支持强制等待,是否上报前置校验信息;
var phase int
targetReplicas := newGrapherTargetInstance
dep := newGrapher.Deployment()
if dep.PhasedPublish() {
  if oldGrapher != nil {
    err = oldGrapher.Refresh()
    if err != nil {
      deployStepInfo.Paused = false
      deployStepInfo.ErrorMsg = err.Error()
      return deployStepInfo, err
    }
  }
  // 获取规则和等待时间
  waitRule := newGrapher.WaitRule(logger)
  waitTime := newGrapher.WaitTime(logger)
  // 获取开启的绿色通道
  greenChannel := newGrapher.GreenChannel()
  // 绿色通道、计算批次和分级规则共同作用下当前阶段目标版本实例数
  targetReplicas, phase = calcParamsByWaitRule(logger, waitRule, waitTime, newGrapher.QualifiedReplicas(), newGrapherTargetInstance, greenChannel) // 基于等待规则,覆写newGrapherTargetInstance
  newGrapherTargetInstance, oldGrapherTargetInstance = adjustGrapherTargetInstance(newGrapherTargetInstance, oldGrapherTargetInstance, targetReplicas)
  if oldGrapher == nil { // 无旧版本情况
    oldGrapherTargetInstance = 0
  }
}
// 上报变更前置校验信息
reportErr := reportStepInfo4PreCheck(newGrapher, oldGrapher, targetReplicas, phase, batchSize)
if reportErr != nil {
  beego.Error(reportErr)
  //logger.LogInfo("前置校验信息上报失败:", reportErr)
}

The platform also introduces a release‑time HPA that decouples autoscaling decisions from the release flow, applying capacity factors in a unidirectional, closed‑loop manner while preserving existing release pipelines and tiered controls.

Green Channel

The green channel provides an end‑to‑end emergency escape path for critical incidents, known bugs, or sudden traffic spikes. It integrates with multiple upstream platforms to ensure a single, coordinated fast‑track process, while respecting SLA differences across systems.

Implementation Results

Visual dashboards demonstrate release metrics, pre‑check outcomes, and tiered release behavior. The system has reduced change‑induced incidents and improved observability during releases.

Conclusion and Outlook

Future work includes customizable tiered release rules per department, extended pre‑check indicators for core business metrics, and AI‑driven risk analysis based on application and user behavior profiles.

References

1. Alibaba large‑scale Kubernetes operations – https://developer.aliyun.com/article/840561 2. Micro‑service gray release solutions – https://developer.aliyun.com/article/918366 3. SOFAStack – http://nobodyiam.com/2020/12/26/large-scale-implementation-and-prospect-of-sofastack-mesh/ 4. Ant Group Change‑Control Platform AlterShield – https://blog.csdn.net/TRaaS/article/details/131174460

cloud nativeCI/CDchange managementStabilitycontainer platformrelease engineering
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.