Beidou Container Operations Management Platform: Architecture, Automation, and Capabilities
The Beidou Operations Management Platform, created by vivo’s Internet Server team, unifies management of over twenty Kubernetes clusters and tens of thousands of nodes, automates scaling, inspections, event collection, and Helm‑based application deployment, achieving more than 90% UI‑driven operations and dramatically improving stability and operational efficiency.
The Beidou Operations Management Platform was developed by the vivo Internet Server team to address the growing difficulty of managing over 20 production Kubernetes clusters and tens of thousands of physical nodes. The platform provides resource management, cluster scaling, inspection, event handling, and monitoring to improve stability and operational efficiency.
Challenges in the early stage of container platform construction
Complex black‑screen operations : Manual procedures relied heavily on individual engineer experience and were error‑prone.
Time‑consuming manual inspections : Essential cluster inspections required significant manual effort.
Difficulty managing multiple clusters : As business grew, the number of clusters increased, raising operational complexity.
Complexity of self‑developed components : Managing an expanding set of custom components became a challenge.
Historical event query difficulty : Large‑scale clusters generated massive event logs that were hard to store and query quickly.
To overcome these issues, the team pursued white‑screen (UI‑driven) and automation approaches, converting manual black‑screen steps into programmatic workflows.
Beidou platform solution
Achieve >90% white‑screen rate for high‑frequency operations.
Unified multi‑cluster resource and configuration management.
Automated inspection framework with customizable scripts.
Application center for standardized component installation, upgrade, and configuration.
Event collection and monitoring pipeline storing billions of events in Elasticsearch.
1. Node scaling tool
The platform introduces a kubeops-controller that processes a custom Operation CRD to perform automatic cluster node scaling. The CRD definition is:
apiVersion: vcluster.caas.xxxx.com/v1alpha1 kind: Operation name: scaleup-2024 namespace: beidou-system spec: clusterName: product-cluster operationType: ScaleUp operationFlow: - preCheck - scaleUp - postCheck operationMachines: - ip: 127.0.0.1 role: Compute user: adminScaling consists of three steps—pre‑check, scale‑up, and post‑check—each executed as a Kubernetes Job that runs Ansible scripts (based on a customized Kubespray). The reverse operation (scale‑down) follows a similar two‑step process (pre‑check and removal).
2. Full‑process automation
To further reduce manual effort, an AutoOperationTask CRD was introduced, orchestrating the entire scaling workflow:
apiVersion: vcluster.caas.xxxx.com/v1alpha1 kind: AutoOperationTask metadata: name: scaleup-xxxxxx namespace: beidou-system spec: cluster: cluster-example clusterType: native operationIds: - scale-up operationTaskMachines: - name: xx.xx.xx.xx - name: xx.xx.xx.xx operationTaskStep: - step: ScaleUpPreCheck - step: ScaleUpWorkprocess - step: ScaleUp - step: UncordonNodes - step: ScaleUpSubWorkprocess operationTaskType: ScaleUpThe AutoOperationTask‑controller watches these resources, creates the corresponding Operation objects, and triggers the scaling jobs, reducing a 20‑node expansion from 60 minutes to about 10 minutes.
3. Cluster inspection tool (kube‑doctor)
The inspection framework defines an InspectionInterface in Go:
type InspectionInterface interface { RunInspectionTask(ctx context.Context, cluster []ScheduledCluster) (error, []string, *AlertInfo) StoreReport(result interface{}, cluster ScheduledClusteralertMessages *SyncMessages) error }Implementations cover custom scripts, data metrics (via Prometheus), and node‑problem‑detector metrics. The system records thousands of inspection reports, helping operators quickly locate configuration or resource issues.
4. Application Center
To manage the growing number of self‑developed components, the platform provides a Helm‑based Application Center. It defines helmapp and helmappversion CRDs for chart storage, and a release CRD representing an installed instance. The workflow:
Upload a Helm chart to vivo object storage.
Create a helmapp CRD (metadata) and a helmappversion CRD (version & storage address).
Instantiate a release CRD; the beidou‑release‑controller fetches the chart, applies values, and installs it into the target cluster.
Over 50 applications and 200+ instances are now managed through this center.
5. Event collection and monitoring
The beidou‑event component streams Kubernetes events from the API server to a file system, which are then ingested by the vivo logging system and stored in Elasticsearch. The beidou‑api queries Elasticsearch to present events in the UI. Currently, the system holds over 3 billion events, enabling rapid historical troubleshooting.
Summary and Outlook
More than 90% of high‑frequency operations have been white‑screened.
Cluster installation and scaling have processed tens of thousands of nodes with one‑click automation.
The Operations Center provides comprehensive resource and health monitoring.
Inspection and event collection have generated thousands of reports, greatly improving issue detection.
The Application Center standardizes component lifecycle management.
Future work will integrate AI for automatic problem detection and resolution, moving toward an intelligent, fully automated operations platform.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.