Cloud Native 20 min read

Beidou Container Operations Management Platform: Architecture, Automation, and Capabilities

The Beidou Operations Management Platform, created by vivo’s Internet Server team, unifies management of over twenty Kubernetes clusters and tens of thousands of nodes, automates scaling, inspections, event collection, and Helm‑based application deployment, achieving more than 90% UI‑driven operations and dramatically improving stability and operational efficiency.

vivo Internet Technology

Mar 5, 2025

Beidou Container Operations Management Platform: Architecture, Automation, and Capabilities

The Beidou Operations Management Platform was developed by the vivo Internet Server team to address the growing difficulty of managing over 20 production Kubernetes clusters and tens of thousands of physical nodes. The platform provides resource management, cluster scaling, inspection, event handling, and monitoring to improve stability and operational efficiency.

Challenges in the early stage of container platform construction

Complex black‑screen operations : Manual procedures relied heavily on individual engineer experience and were error‑prone.

Time‑consuming manual inspections : Essential cluster inspections required significant manual effort.

Difficulty managing multiple clusters : As business grew, the number of clusters increased, raising operational complexity.

Complexity of self‑developed components : Managing an expanding set of custom components became a challenge.

Historical event query difficulty : Large‑scale clusters generated massive event logs that were hard to store and query quickly.

To overcome these issues, the team pursued white‑screen (UI‑driven) and automation approaches, converting manual black‑screen steps into programmatic workflows.

Beidou platform solution

Achieve >90% white‑screen rate for high‑frequency operations.

Unified multi‑cluster resource and configuration management.

Automated inspection framework with customizable scripts.

Application center for standardized component installation, upgrade, and configuration.

Event collection and monitoring pipeline storing billions of events in Elasticsearch.

1. Node scaling tool

The platform introduces a kubeops-controller that processes a custom Operation CRD to perform automatic cluster node scaling. The CRD definition is:

apiVersion: vcluster.caas.xxxx.com/v1alpha1

kind: Operation

name: scaleup-2024

namespace: beidou-system

spec:

clusterName: product-cluster

operationType: ScaleUp

operationFlow:

- preCheck

- scaleUp

- postCheck

operationMachines:

- ip: 127.0.0.1

role: Compute

user: admin

Scaling consists of three steps—pre‑check, scale‑up, and post‑check—each executed as a Kubernetes Job that runs Ansible scripts (based on a customized Kubespray). The reverse operation (scale‑down) follows a similar two‑step process (pre‑check and removal).

2. Full‑process automation

To further reduce manual effort, an AutoOperationTask CRD was introduced, orchestrating the entire scaling workflow:

apiVersion: vcluster.caas.xxxx.com/v1alpha1

kind: AutoOperationTask

metadata:

name: scaleup-xxxxxx

namespace: beidou-system

spec:

cluster: cluster-example

clusterType: native

operationIds:

- scale-up

operationTaskMachines:

- name: xx.xx.xx.xx

- name: xx.xx.xx.xx

operationTaskStep:

- step: ScaleUpPreCheck

- step: ScaleUpWorkprocess

- step: ScaleUp

- step: UncordonNodes

- step: ScaleUpSubWorkprocess

operationTaskType: ScaleUp

The AutoOperationTask‑controller watches these resources, creates the corresponding Operation objects, and triggers the scaling jobs, reducing a 20‑node expansion from 60 minutes to about 10 minutes.

3. Cluster inspection tool (kube‑doctor)

The inspection framework defines an InspectionInterface in Go:

type InspectionInterface interface {

RunInspectionTask(ctx context.Context, cluster []ScheduledCluster) (error, []string, *AlertInfo)

StoreReport(result interface{}, cluster ScheduledClusteralertMessages *SyncMessages) error

Implementations cover custom scripts, data metrics (via Prometheus), and node‑problem‑detector metrics. The system records thousands of inspection reports, helping operators quickly locate configuration or resource issues.

4. Application Center

To manage the growing number of self‑developed components, the platform provides a Helm‑based Application Center. It defines helmapp and helmappversion CRDs for chart storage, and a release CRD representing an installed instance. The workflow:

Upload a Helm chart to vivo object storage.

Create a helmapp CRD (metadata) and a helmappversion CRD (version & storage address).

Instantiate a release CRD; the beidou‑release‑controller fetches the chart, applies values, and installs it into the target cluster.

Over 50 applications and 200+ instances are now managed through this center.

5. Event collection and monitoring

The beidou‑event component streams Kubernetes events from the API server to a file system, which are then ingested by the vivo logging system and stored in Elasticsearch. The beidou‑api queries Elasticsearch to present events in the UI. Currently, the system holds over 3 billion events, enabling rapid historical troubleshooting.

Summary and Outlook

More than 90% of high‑frequency operations have been white‑screened.

Cluster installation and scaling have processed tens of thousands of nodes with one‑click automation.

The Operations Center provides comprehensive resource and health monitoring.

Inspection and event collection have generated thousands of reports, greatly improving issue detection.

The Application Center standardizes component lifecycle management.

Future work will integrate AI for automatic problem detection and resolution, moving toward an intelligent, fully automated operations platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kubernetes devops Container Management Operations Automation

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.