Cloud Native 14 min read

How Vivo Scales Multi‑Data‑Center Kubernetes with a Custom Operator

Vivo describes how it built a Kubernetes‑Operator and CI pipeline to automate large‑scale, multi‑data‑center cluster deployment, modular management, and lifecycle operations using Ansible, kubeadm, and kubevirt, improving reliability, maintainability, and scalability of its Kubernetes fleets.

ITPUB
ITPUB
ITPUB
How Vivo Scales Multi‑Data‑Center Kubernetes with a Custom Operator

Background

Vivo’s business migration to Kubernetes has required deployment of many large clusters across multiple data centers. Managing OS, Docker, etcd, Kubernetes, CNI, and network plugins manually is error‑prone and labor‑intensive, prompting the need for a more efficient, reliable solution.

Challenges with Existing Ansible Workflow

Manual, “black‑screen” operations lead to mistakes and configuration drift.

No version control for deployment scripts, hindering upgrades.

Long validation cycles without automated test cases or CI.

Monolithic playbooks lack modularity; component‑level tasks cannot be executed independently.

Binary‑only deployment requires a custom management system, reducing efficiency.

Component parameters are numerous (over 100) and change frequently across releases.

Introducing the Kubernetes‑Operator

The team built a declarative Kubernetes‑Operator that exposes custom resources (CRs) for administrators to interact with, allowing a single admin to manage thousands of nodes while reducing operational risk.

Cluster Deployment Practice

2.1 Deployment Overview

The deployment is based on Ansible tasks that provision OS, Docker, etcd, Kubernetes, and add‑ons.

Bootstrap OS

Pre‑install steps

Install Docker

Install etcd

Install Kubernetes Master

Install Kubernetes node

Configure network plugin

Install add‑ons

Post‑install setup

After initial deployment, the operator enables modular updates—e.g., upgrading only Docker or etcd—by exposing separate Ansible entry points for each component.

Component Parameter Management

Parameters are handled via the ComponentConfig API, providing:

Maintainability: Easier when >50 parameters exist.

Upgradeability: Versioned configs simplify upgrades.

Programmability: Dynamic kubelet config changes take effect without restarts.

Configurability: Supports complex structures beyond simple key‑value pairs.

Planned Migration to kubeadm

Leverage kubeadm for lifecycle management and reduce maintenance overhead.

Use kubeadm’s certificate handling to store certs in Secrets.

Generate admin kubeconfig via kubeadm.

Utilize kubeadm features such as image management, upload‑config, node labeling, and taints.

Install CoreDNS and kube‑proxy add‑ons.

Ansible Usage Guidelines

Prefer built‑in Ansible modules for deployment logic.

Avoid hostvars and delegate_to.

Enable --limit mode.

2.2 CI Matrix Testing

Extensive CI tests validate syntax, deployment, scaling, upgrades, and performance:

Syntax checks: ansible‑lint, shellcheck, yamllint, syntax‑check, pep8.

Cluster operations: deploy, scale control/compute/etcd nodes, upgrade, modify component parameters.

Functional & performance checks: API health, network connectivity, node health, e2e and conformance tests.

The CI pipeline uses GitLab, gitlab‑runner, Ansible, and kubevirt. The steps are:

Deploy gitlab‑runner in the cluster and connect to the GitLab repository.

Deploy CDI (Containerized Data Importer) to create PVC‑backed VM images.

Deploy kubevirt to run virtual machines.

Create gitlab‑ci.yaml to define the test matrix.

When a developer pushes a PR, the CI triggers Ansible syntax checks, creates namespaces, PVCs, and kubevirt VM templates, then runs the deployment playbooks. After the cluster is up, functional and performance tests run, and resources are cleaned up.

Kubernetes‑Operator Details

3.1 What Is an Operator?

An Operator is a controller that extends the Kubernetes API to manage complex applications (e.g., databases, etcd) through custom resources, automating their full lifecycle.

3.2 Custom Resources (CRs)

Key CR types include:

ClusterDeployment : Entry point for all configuration (etcd, Kubernetes, LB, version, network, add‑ons).

MachineSet : Group of machines (control, compute, etcd) with their desired state.

Machine : Individual machine details and status.

Cluster : Status sub‑resource linked to ClusterDeployment.

Operators also use Ansible‑based executors (Jobs, ConfigMaps, Secrets) to run playbooks and track results.

3.3 Architecture

The operator runs in a metadata cluster that manages multiple business clusters, providing centralized multi‑cloud management, unified scheduling, high availability, and disaster recovery.

3.4 Execution Flow

Admin or platform creates a ClusterDeployment CR.

The controller detects the change.

Creates MachineSet and associated Machine resources. ClusterInstall controller generates ConfigMaps and Jobs that run the appropriate Ansible playbooks for install, scale, or upgrade.

Scheduler binds the Job’s Pod.

Kubelet executes the playbook inside the Pod.

Job controller updates the ClusterDeployment status and cleans up resources. NodeHealthy controller syncs node readiness to Machine status.

Add‑on controller installs or upgrades add‑ons once the cluster is ready.

Conclusion

Vivo’s large‑scale Kubernetes operations combine optimized deployment tooling, extensive CI matrix testing, and a custom operator that treats Kubernetes as a service on top of Kubernetes, dramatically improving safety, stability, and manageability across many data‑center clusters while paving the way for future multi‑cloud integration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud Nativeci/cdautomationKubernetesOperatorMulti-ClusterAnsible
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.