How Vivo Scales Multi‑Data‑Center Kubernetes with a Custom Operator
Vivo describes how it built a Kubernetes‑Operator and CI pipeline to automate large‑scale, multi‑data‑center cluster deployment, modular management, and lifecycle operations using Ansible, kubeadm, and kubevirt, improving reliability, maintainability, and scalability of its Kubernetes fleets.
Background
Vivo’s business migration to Kubernetes has required deployment of many large clusters across multiple data centers. Managing OS, Docker, etcd, Kubernetes, CNI, and network plugins manually is error‑prone and labor‑intensive, prompting the need for a more efficient, reliable solution.
Challenges with Existing Ansible Workflow
Manual, “black‑screen” operations lead to mistakes and configuration drift.
No version control for deployment scripts, hindering upgrades.
Long validation cycles without automated test cases or CI.
Monolithic playbooks lack modularity; component‑level tasks cannot be executed independently.
Binary‑only deployment requires a custom management system, reducing efficiency.
Component parameters are numerous (over 100) and change frequently across releases.
Introducing the Kubernetes‑Operator
The team built a declarative Kubernetes‑Operator that exposes custom resources (CRs) for administrators to interact with, allowing a single admin to manage thousands of nodes while reducing operational risk.
Cluster Deployment Practice
2.1 Deployment Overview
The deployment is based on Ansible tasks that provision OS, Docker, etcd, Kubernetes, and add‑ons.
Bootstrap OS
Pre‑install steps
Install Docker
Install etcd
Install Kubernetes Master
Install Kubernetes node
Configure network plugin
Install add‑ons
Post‑install setup
After initial deployment, the operator enables modular updates—e.g., upgrading only Docker or etcd—by exposing separate Ansible entry points for each component.
Component Parameter Management
Parameters are handled via the ComponentConfig API, providing:
Maintainability: Easier when >50 parameters exist.
Upgradeability: Versioned configs simplify upgrades.
Programmability: Dynamic kubelet config changes take effect without restarts.
Configurability: Supports complex structures beyond simple key‑value pairs.
Planned Migration to kubeadm
Leverage kubeadm for lifecycle management and reduce maintenance overhead.
Use kubeadm’s certificate handling to store certs in Secrets.
Generate admin kubeconfig via kubeadm.
Utilize kubeadm features such as image management, upload‑config, node labeling, and taints.
Install CoreDNS and kube‑proxy add‑ons.
Ansible Usage Guidelines
Prefer built‑in Ansible modules for deployment logic.
Avoid hostvars and delegate_to.
Enable --limit mode.
2.2 CI Matrix Testing
Extensive CI tests validate syntax, deployment, scaling, upgrades, and performance:
Syntax checks: ansible‑lint, shellcheck, yamllint, syntax‑check, pep8.
Cluster operations: deploy, scale control/compute/etcd nodes, upgrade, modify component parameters.
Functional & performance checks: API health, network connectivity, node health, e2e and conformance tests.
The CI pipeline uses GitLab, gitlab‑runner, Ansible, and kubevirt. The steps are:
Deploy gitlab‑runner in the cluster and connect to the GitLab repository.
Deploy CDI (Containerized Data Importer) to create PVC‑backed VM images.
Deploy kubevirt to run virtual machines.
Create gitlab‑ci.yaml to define the test matrix.
When a developer pushes a PR, the CI triggers Ansible syntax checks, creates namespaces, PVCs, and kubevirt VM templates, then runs the deployment playbooks. After the cluster is up, functional and performance tests run, and resources are cleaned up.
Kubernetes‑Operator Details
3.1 What Is an Operator?
An Operator is a controller that extends the Kubernetes API to manage complex applications (e.g., databases, etcd) through custom resources, automating their full lifecycle.
3.2 Custom Resources (CRs)
Key CR types include:
ClusterDeployment : Entry point for all configuration (etcd, Kubernetes, LB, version, network, add‑ons).
MachineSet : Group of machines (control, compute, etcd) with their desired state.
Machine : Individual machine details and status.
Cluster : Status sub‑resource linked to ClusterDeployment.
Operators also use Ansible‑based executors (Jobs, ConfigMaps, Secrets) to run playbooks and track results.
3.3 Architecture
The operator runs in a metadata cluster that manages multiple business clusters, providing centralized multi‑cloud management, unified scheduling, high availability, and disaster recovery.
3.4 Execution Flow
Admin or platform creates a ClusterDeployment CR.
The controller detects the change.
Creates MachineSet and associated Machine resources. ClusterInstall controller generates ConfigMaps and Jobs that run the appropriate Ansible playbooks for install, scale, or upgrade.
Scheduler binds the Job’s Pod.
Kubelet executes the playbook inside the Pod.
Job controller updates the ClusterDeployment status and cleans up resources. NodeHealthy controller syncs node readiness to Machine status.
Add‑on controller installs or upgrades add‑ons once the cluster is ready.
Conclusion
Vivo’s large‑scale Kubernetes operations combine optimized deployment tooling, extensive CI matrix testing, and a custom operator that treats Kubernetes as a service on top of Kubernetes, dramatically improving safety, stability, and manageability across many data‑center clusters while paving the way for future multi‑cloud integration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
