How Vivo Scales Multi‑Data‑Center Kubernetes with a Custom Operator
This article details Vivo's approach to managing thousands of Kubernetes nodes across multiple data centers by developing a declarative Kubernetes‑Operator, modular Ansible scripts, and a comprehensive CI matrix to automate deployment, scaling, upgrades, and fault recovery while reducing operational risk.
Background
Vivo's rapid migration of services to Kubernetes required deploying clusters in many data centers, which introduced challenges such as manual, error‑prone operations, lack of version control for deployment scripts, missing automated tests, and tangled component parameter management.
Cluster Deployment Practice
The team built an Ansible‑based workflow that installs OS, Docker, etcd, Kubernetes, and add‑ons in a defined order:
Bootstrap OS
Pre‑install steps
Install Docker
Install etcd
Install Kubernetes master
Install Kubernetes node
Configure network plugin
Install add‑ons
Post‑install setup
After initial deployment, each component (Docker, etcd, K8s, network plugin, add‑ons) can be managed modularly via separate Ansible entry points, avoiding full‑stack script runs.
CI Matrix Testing
To ensure reliability, a CI pipeline built with GitLab, gitlab‑runner, Ansible, and KubeVirt runs extensive tests:
Syntax checks (ansible‑lint, shellcheck, yamllint, etc.)
Cluster lifecycle tests (deployment, scaling, upgrade, parameter changes)
Functional and performance tests (apiserver health, node networking, e2e, conformance)
CI jobs are triggered by pull‑request submissions, creating namespaces, PVCs, and KubeVirt VMs to execute the full test matrix without interference between jobs.
Kubernetes‑Operator Practice
The custom operator extends the Kubernetes API, allowing administrators to manage complex applications via CR resources. It supports deployment, upgrade, scaling, backup, self‑healing, and more.
Operator Custom Resources
ClusterDeployment : Entry point CR that defines all cluster parameters (etcd, K8s version, LB, network, add‑ons).
MachineSet : Collection of machine roles (control plane, workers, etcd).
Machine : Individual machine details and status.
Cluster : Status sub‑resource linked to ClusterDeployment.
Ansible executor : Jobs, ConfigMaps, and Secrets that run Ansible playbooks and store inventories and variables.
Extension controllers : Add‑on installer, cluster installer, remote MachineSet manager, and others for public‑cloud, DNS, LB integration.
Operator Architecture
Vivo runs the operator in a metadata cluster that manages multiple business clusters. The architecture leverages K8s scheduling, networking isolation, and API consistency to provide centralized multi‑cloud management, high availability, and disaster recovery.
Scenarios
Scenario 1 – Cluster Expansion : When a capacity request is approved, the PASS platform creates Machine CRs from a spare pool, generates inventories, and runs Ansible jobs to provision new nodes. Successful jobs update Machine status to deployed and the node becomes ready.
Scenario 2 – Fault Recovery : If a business cluster fails, the operator either relies on other clusters to take over (no action) or, when needed, selects spare machines, runs the installation playbook, and migrates workloads to the newly provisioned cluster.
Execution Flow
Administrator or platform creates a ClusterDeployment CR.
The controller detects the change and creates associated MachineSet and Machine resources. ClusterInstall controller generates ConfigMaps and Jobs that invoke the appropriate Ansible playbooks.
K8s scheduler places the Job pods.
Kubelet runs the pods, executing the Ansible scripts.
Job controller updates the ClusterDeployment status and cleans up resources.
Node health controller syncs node readiness back to Machine status.
Add‑on controller installs or upgrades add‑ons once the cluster is ready.
Conclusion
Vivo's large‑scale K8s operations combine a declarative operator, modular Ansible automation, and a robust CI matrix to achieve safe, repeatable cluster management across many data centers, reducing operational overhead while supporting future multi‑cloud expansion.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
