Cloud Native 20 min read

Beidou Container Operations Management Platform: Architecture, Automation, and Capabilities

The Beidou Operations Management Platform, created by vivo’s Internet Server team, unifies management of over twenty Kubernetes clusters and tens of thousands of nodes, automates scaling, inspections, event collection, and Helm‑based application deployment, achieving more than 90% UI‑driven operations and dramatically improving stability and operational efficiency.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Beidou Container Operations Management Platform: Architecture, Automation, and Capabilities

The Beidou Operations Management Platform was developed by the vivo Internet Server team to address the growing difficulty of managing over 20 production Kubernetes clusters and tens of thousands of physical nodes. The platform provides resource management, cluster scaling, inspection, event handling, and monitoring to improve stability and operational efficiency.

Challenges in the early stage of container platform construction

Complex black‑screen operations : Manual procedures relied heavily on individual engineer experience and were error‑prone.

Time‑consuming manual inspections : Essential cluster inspections required significant manual effort.

Difficulty managing multiple clusters : As business grew, the number of clusters increased, raising operational complexity.

Complexity of self‑developed components : Managing an expanding set of custom components became a challenge.

Historical event query difficulty : Large‑scale clusters generated massive event logs that were hard to store and query quickly.

To overcome these issues, the team pursued white‑screen (UI‑driven) and automation approaches, converting manual black‑screen steps into programmatic workflows.

Beidou platform solution

Achieve >90% white‑screen rate for high‑frequency operations.

Unified multi‑cluster resource and configuration management.

Automated inspection framework with customizable scripts.

Application center for standardized component installation, upgrade, and configuration.

Event collection and monitoring pipeline storing billions of events in Elasticsearch.

1. Node scaling tool

The platform introduces a kubeops-controller that processes a custom Operation CRD to perform automatic cluster node scaling. The CRD definition is:

apiVersion: vcluster.caas.xxxx.com/v1alpha1
kind: Operation
name: scaleup-2024
namespace: beidou-system
spec:
clusterName: product-cluster
operationType: ScaleUp
operationFlow:
- preCheck
- scaleUp
- postCheck
operationMachines:
- ip: 127.0.0.1
role: Compute
user: admin

Scaling consists of three steps—pre‑check, scale‑up, and post‑check—each executed as a Kubernetes Job that runs Ansible scripts (based on a customized Kubespray). The reverse operation (scale‑down) follows a similar two‑step process (pre‑check and removal).

2. Full‑process automation

To further reduce manual effort, an AutoOperationTask CRD was introduced, orchestrating the entire scaling workflow:

apiVersion: vcluster.caas.xxxx.com/v1alpha1
kind: AutoOperationTask
metadata:
name: scaleup-xxxxxx
namespace: beidou-system
spec:
cluster: cluster-example
clusterType: native
operationIds:
- scale-up
operationTaskMachines:
- name: xx.xx.xx.xx
- name: xx.xx.xx.xx
operationTaskStep:
- step: ScaleUpPreCheck
- step: ScaleUpWorkprocess
- step: ScaleUp
- step: UncordonNodes
- step: ScaleUpSubWorkprocess
operationTaskType: ScaleUp

The AutoOperationTask‑controller watches these resources, creates the corresponding Operation objects, and triggers the scaling jobs, reducing a 20‑node expansion from 60 minutes to about 10 minutes.

3. Cluster inspection tool (kube‑doctor)

The inspection framework defines an InspectionInterface in Go:

type InspectionInterface interface {
RunInspectionTask(ctx context.Context, cluster []ScheduledCluster) (error, []string, *AlertInfo)
StoreReport(result interface{}, cluster ScheduledClusteralertMessages *SyncMessages) error
}

Implementations cover custom scripts, data metrics (via Prometheus), and node‑problem‑detector metrics. The system records thousands of inspection reports, helping operators quickly locate configuration or resource issues.

4. Application Center

To manage the growing number of self‑developed components, the platform provides a Helm‑based Application Center. It defines helmapp and helmappversion CRDs for chart storage, and a release CRD representing an installed instance. The workflow:

Upload a Helm chart to vivo object storage.

Create a helmapp CRD (metadata) and a helmappversion CRD (version & storage address).

Instantiate a release CRD; the beidou‑release‑controller fetches the chart, applies values, and installs it into the target cluster.

Over 50 applications and 200+ instances are now managed through this center.

5. Event collection and monitoring

The beidou‑event component streams Kubernetes events from the API server to a file system, which are then ingested by the vivo logging system and stored in Elasticsearch. The beidou‑api queries Elasticsearch to present events in the UI. Currently, the system holds over 3 billion events, enabling rapid historical troubleshooting.

Summary and Outlook

More than 90% of high‑frequency operations have been white‑screened.

Cluster installation and scaling have processed tens of thousands of nodes with one‑click automation.

The Operations Center provides comprehensive resource and health monitoring.

Inspection and event collection have generated thousands of reports, greatly improving issue detection.

The Application Center standardizes component lifecycle management.

Future work will integrate AI for automatic problem detection and resolution, moving toward an intelligent, fully automated operations platform.

MonitoringKubernetesDevOpsScalingContainer Managementoperations automation
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.