Cloud Native 15 min read

How Vivo Built a Scalable Karmada Operator with Ansible for Multi‑Cluster Management

Vivo’s engineering team shares their practical experience creating a Karmada‑Operator using the Operator SDK and Ansible, detailing background, deployment challenges, design choices, API and architecture, etcd management, member cluster handling, CI pipeline, and performance testing to enable robust multi‑cloud Kubernetes orchestration.

Efficient Ops
Efficient Ops
Efficient Ops
How Vivo Built a Scalable Karmada Operator with Ansible for Multi‑Cluster Management

Background

Karmada is an open‑source cloud‑native multi‑cloud container orchestration project that has attracted many enterprises and is running in production. Multi‑cloud has become a foundational infrastructure for data‑center construction, driving rapid development of multi‑region disaster recovery, large‑scale multi‑cluster management, cross‑cloud elasticity, and migration scenarios.

Vivo migrated its business to Kubernetes, causing rapid growth in cluster size and number, which increased operational difficulty. After building an internal multi‑cluster management solution that still fell short, the team evaluated community projects and chose Karmada.

Unified management of multiple Kubernetes clusters, reducing platform complexity.

Cross‑cluster elastic scaling and scheduling to improve resource utilization and cut costs.

Karmada uses native Kubernetes APIs, lowering migration effort.

Disaster recovery: decoupled control plane and member clusters enable resource reallocation on failures.

Extensibility: custom scheduling plugins and OpenKruise interpreter plugins can be added.

Karmada‑Operator Implementation

2.1 Operator SDK Overview

The Operator Framework provides a toolkit for building Kubernetes native applications (Operators) in an automated, scalable way. Operators simplify management of complex, stateful workloads by leveraging Kubernetes extensibility for provisioning, scaling, backup, and recovery.

Writing Operators can be challenging due to low‑level APIs, boilerplate code, and lack of modularity. The Operator SDK mitigates these challenges by offering high‑level APIs, scaffolding, code generation, and extensions for common use cases.

2.2 Solution Selection

Option 1: Go‑based Operator – suited for stateful services on Kubernetes but limited for binary deployments, external etcd, and member‑cluster registration.

Option 2: Ansible‑based Operator – supports both Kubernetes‑based and binary deployments, external etcd, and member‑cluster lifecycle via SSH and Ansible modules.

Option 3: Hybrid Go + Ansible Operator – combines capabilities of Option 2 with Go‑level flexibility.

After evaluating the three options, Vivo selected the Ansible‑based Operator (Option 2) because it provides feature parity with the Go SDK, matches Karmada’s production requirements, is easy to learn for Ansible users, offers strong extensibility, and avoids the need for extensive Go code.

2.3 API Design

The Operator SDK can generate a CRD named

KarmadaDeployment

. Additional CRDs

EtcdBackup

and

EtcdRestore

are defined for etcd data management. The

spec

fields are translated into Ansible variables, and the

status

is populated by the Ansible runner or the

k8s_status

module.

2.4 Architecture Design

The architecture supports both containerized and binary deployments. Containerized deployment relies solely on Kubernetes APIs, while binary deployment uses SSH to manage the Karmada control plane and member clusters. Member clusters are registered via provided kubeconfig and credentials defined in the CR.

2.5 Control Plane Management

Standardized certificate management using OpenSSL, separating etcd and Karmada certificates.

Karmada‑apiserver can use external load balancers instead of Kubernetes Services.

Flexible upgrade strategies supporting component‑wise and full‑cluster upgrades.

Rich global variable definitions to enable component configuration changes.

2.6 etcd Cluster Management

etcd is the metadata store for Karmada and must be highly available in production. The Operator provides Ansible plugins to manage etcd clusters, including adding/removing members, backup (e.g., to CephFS), recovery, and health checks.

2.7 Member Cluster Management

Member clusters are registered and deregistered through dynamic Ansible inventory generation based on the

KarmadaDeployment

spec. Two roles,

add‑member

and

del‑member

, handle join and unjoin operations, supporting concurrent processing and optional SSH mode.

CI Introduction

To improve developer experience, Vivo built a CI pipeline that runs GitHub self‑hosted runners and KubeVirt VMs. The pipeline executes syntax and unit tests, provisions VMs, deploys one host and two member clusters, installs Karmada, and runs e2e and Bookinfo tests. Planned CI matrix tests include linting (ansible‑lint, shellcheck, yamllint, etc.), full deployment validation (karmadactl, charts, binary), member join/unjoin, Karmada upgrades, etcd backup/restore, and performance testing with 2000‑node simulations.

Conclusion

Through community research and Vivo’s practice, the Karmada‑Operator design was finalized. The Ansible‑based Operator offers high extensibility, reliability, intuitive logic authoring, and out‑of‑the‑box functionality, providing a robust foundation for managing Karmada at scale. Remaining challenges include adding webhook support and richer CRD scaffolding. Ongoing development will continue to enhance features and stability.

cloud nativeKubernetesmulti-clusterKarmadaAnsibleOperator SDK
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.