Cloud Native 13 min read

How Vivo Scales Multi‑Data‑Center Kubernetes with a Custom Operator

This article details Vivo's approach to managing thousands of Kubernetes nodes across multiple data centers by developing a declarative Kubernetes‑Operator, modular Ansible scripts, and a comprehensive CI matrix to automate deployment, scaling, upgrades, and fault recovery while reducing operational risk.

dbaplus Community

Apr 22, 2023

How Vivo Scales Multi‑Data‑Center Kubernetes with a Custom Operator

Background

Vivo's rapid migration of services to Kubernetes required deploying clusters in many data centers, which introduced challenges such as manual, error‑prone operations, lack of version control for deployment scripts, missing automated tests, and tangled component parameter management.

Cluster Deployment Practice

The team built an Ansible‑based workflow that installs OS, Docker, etcd, Kubernetes, and add‑ons in a defined order:

Bootstrap OS

Pre‑install steps

Install Docker

Install etcd

Install Kubernetes master

Install Kubernetes node

Configure network plugin

Install add‑ons

Post‑install setup

After initial deployment, each component (Docker, etcd, K8s, network plugin, add‑ons) can be managed modularly via separate Ansible entry points, avoiding full‑stack script runs.

CI Matrix Testing

To ensure reliability, a CI pipeline built with GitLab, gitlab‑runner, Ansible, and KubeVirt runs extensive tests:

Syntax checks (ansible‑lint, shellcheck, yamllint, etc.)

Cluster lifecycle tests (deployment, scaling, upgrade, parameter changes)

Functional and performance tests (apiserver health, node networking, e2e, conformance)

CI jobs are triggered by pull‑request submissions, creating namespaces, PVCs, and KubeVirt VMs to execute the full test matrix without interference between jobs.

Kubernetes‑Operator Practice

The custom operator extends the Kubernetes API, allowing administrators to manage complex applications via CR resources. It supports deployment, upgrade, scaling, backup, self‑healing, and more.

Operator Custom Resources

ClusterDeployment : Entry point CR that defines all cluster parameters (etcd, K8s version, LB, network, add‑ons).

MachineSet : Collection of machine roles (control plane, workers, etcd).

Machine : Individual machine details and status.

Cluster : Status sub‑resource linked to ClusterDeployment.

Ansible executor : Jobs, ConfigMaps, and Secrets that run Ansible playbooks and store inventories and variables.

Extension controllers : Add‑on installer, cluster installer, remote MachineSet manager, and others for public‑cloud, DNS, LB integration.

Operator Architecture

Vivo runs the operator in a metadata cluster that manages multiple business clusters. The architecture leverages K8s scheduling, networking isolation, and API consistency to provide centralized multi‑cloud management, high availability, and disaster recovery.

Scenarios

Scenario 1 – Cluster Expansion : When a capacity request is approved, the PASS platform creates Machine CRs from a spare pool, generates inventories, and runs Ansible jobs to provision new nodes. Successful jobs update Machine status to deployed and the node becomes ready.

Scenario 2 – Fault Recovery : If a business cluster fails, the operator either relies on other clusters to take over (no action) or, when needed, selects spare machines, runs the installation playbook, and migrates workloads to the newly provisioned cluster.

Execution Flow

Administrator or platform creates a ClusterDeployment CR.

The controller detects the change and creates associated MachineSet and Machine resources. ClusterInstall controller generates ConfigMaps and Jobs that invoke the appropriate Ansible playbooks.

K8s scheduler places the Job pods.

Kubelet runs the pods, executing the Ansible scripts.

Job controller updates the ClusterDeployment status and cleans up resources.

Node health controller syncs node readiness back to Machine status.

Add‑on controller installs or upgrades add‑ons once the cluster is ready.

Conclusion

Vivo's large‑scale K8s operations combine a declarative operator, modular Ansible automation, and a robust CI matrix to achieve safe, repeatable cluster management across many data centers, reducing operational overhead while supporting future multi‑cloud expansion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native ci/cd Kubernetes Operator Ansible Multi-Data Center

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.