Cloud Native 22 min read

Why Multi-Cluster Kubernetes Matters and How Vivo Tackles It

This article examines the motivations, benefits, and existing solutions for Kubernetes multi‑cluster management, then details Vivo's non‑federated and federated approaches, application‑centric continuous delivery, elastic scaling, unified scheduling, gray‑release strategies, and summarizes the current state and challenges.

Efficient Ops
Efficient Ops
Efficient Ops
Why Multi-Cluster Kubernetes Matters and How Vivo Tackles It

Why Multi‑Cluster Is Needed

With the rapid growth of Kubernetes and cloud‑native technologies, containerized workloads have become standardized and decoupled from underlying infrastructure, providing a solid foundation for multi‑cluster and hybrid‑cloud deployments.

1. Single‑cluster capacity limits

Clusters are limited to 5,000 nodes and 150,000 Pods, and the maximum node count varies with deployment patterns and workload characteristics.

2. Multi‑cloud usage

Avoid vendor lock‑in and leverage the latest technologies across different clouds for cost or capability reasons.

3. Traffic bursts

During sudden traffic spikes, workloads can be expanded to public‑cloud clusters, requiring IaaS integration for automatic scaling of CPU‑ and memory‑intensive services.

4. High availability

Single clusters cannot survive network or data‑center failures; a primary‑backup model or read‑write separation across clusters ensures continuity.

5. Geo‑distributed active‑active

Real‑time data synchronization enables simultaneous reads and writes across clusters for critical data such as global user accounts.

6. Regional affinity

Placing services in the same region reduces bandwidth costs and balances load locally.

Multi‑Cluster Exploration

2.1 Community Projects

Federation v1 : Deprecated because it introduced an extra API layer that conflicted with native Kubernetes APIs.

Federation v2 : Also deprecated; it focused on propagating RBAC and policy objects rather than full workload scheduling.

Karmada : Builds on Federation v2 concepts, adding native API support, multi‑level HA, automatic fault‑migration, cross‑cluster autoscaling, and service discovery.

Clusternet : Open‑source platform for multi‑cluster management and cross‑cluster application orchestration, designed for hybrid‑cloud, distributed‑cloud, and edge scenarios.

OCM (Open Cluster Management) : Simplifies multi‑cloud cluster management, supports resource and workload orchestration, and offers an extensible addon framework.

2.2 Vivo’s Exploration

2.2.1 Non‑Federated Cluster Management

Vivo uses a unified web UI to import Kubernetes cluster credentials, view resources, and manage Deployments, Services, and LoadBalancers without adding federation complexity. CI/CD, monitoring, and alerting are integrated, and most workloads remain managed as independent clusters.

2.2.2 Federated Cluster Management

Federation unifies resource management and scheduling across clusters, supporting hybrid‑cloud, private‑cloud, and edge deployments. Although it adds architectural complexity and control‑plane overhead, it enables exciting capabilities such as transparent workload migration and cross‑cluster application orchestration.

Vivo’s federated direction focuses on four areas:

Resource distribution and orchestration

Elastic burst handling

Multi‑cluster scheduling

Service governance and traffic routing

Application‑Oriented Multi‑Cluster Practices

Elasticity : Ensures rapid deployment, scaling, and reliable service delivery.

Usability : Leverages Service Mesh for global governance of micro‑service applications.

Portability : Enables seamless migration across clusters and clouds.

3.1 Continuous Delivery

Vivo registers multiple Kubernetes clusters with Karmada, which handles resource scheduling and fault‑tolerance. The container platform manages K8s resources, Karmada policies, and configurations. CI/CD performs unit tests, security scans, image builds, and generates K8s objects via the platform API for unified delivery.

For complex scenarios such as in‑place upgrades or gray releases, OpenKruise is used. Resources like

PropagationPolicy

and

OverridePolicy

allow up to twelve configuration objects per application.

3.2 Elastic Scaling

3.2.1 FedHPA (Cross‑Cluster HPA)

FedHPA uses native HPA objects; Karmada’s

FedHpaController

distributes min/max replica settings across member clusters and keeps status synchronized.

3.2.2 CronHPA (Scheduled Scaling)

CronHPA defines time‑based scaling windows. The controller creates a

CronHPA

resource, which Karmada‑scheduler translates into per‑cluster replica allocations using the go‑cron library.

3.2.3 Manual & Targeted Scaling

Users specify a workload and desired replica count; Karmada‑scheduler distributes the change across clusters. Targeted scaling can delete specific Pods via

ScaleStrategy.PodsToDelete

and custom resource interpretation.

3.3 Unified Scheduling

3.3.1 Multi‑Cluster Scheduling

Karmada’s scheduler and emulator estimate resources per cluster. Workloads generate ResourceBinding (RB) objects, which are pre‑selected and then optimally assigned to clusters using static or dynamic strategies.

3.3.2 Rescheduling

If a cluster fails or RB allocation deviates from expectations, Karmada re‑evaluates and redistributes workloads to healthy clusters.

3.3.3 Single‑Cluster Scheduler Simulation

Current simulators model four scheduling algorithms using a fake client; improvements are needed to match production schedulers.

3.4 Gray Release

3.4.1 Application Migration

Non‑federated applications are gradually migrated to Karmada via a whitelist, allowing seamless user experience while both management modes coexist.

3.4.2 Rollback

When migration errors occur, administrators remove the application from the whitelist, annotate workloads, and adjust Karmada interpreters to prevent further replica changes, effectively halting control‑plane actions.

3.4.3 Migration Strategy

Test → Pre‑release → Production

Batch gray rollout with a 1:2:7 ratio for major changes

Both parties verify and monitor for 5‑10 minutes

Proceed if no anomalies; otherwise trigger rollback

Summary

Vivo currently relies on non‑federated multi‑cluster management combined with CI/CD to provide rolling updates, gray releases, manual and targeted scaling, and elastic scaling. While non‑federated solutions lack unified resource management, fault‑tolerance, and cross‑cluster scheduling, Vivo is actively exploring these capabilities through federated approaches. Federation adds architectural complexity and control‑plane overhead, and the ecosystem is still evolving, so enterprises should align federation adoption with their specific needs and robust operational monitoring.

References

GitHub: kubernetes-retired/federation

GitHub: kubernetes-retired/kubefed

GitHub: karmada-io/karmada

GitHub: clusternet/clusternet

GitHub: open-cluster-management-io/ocm

GitHub: kubernetes-sigs/cluster-api

GitHub: clusterpedia-io/clusterpedia

GitHub: submariner-io/submariner

GitHub: karmada-io/multi-cluster-ingress-nginx

GitHub: istio/istio

GitHub: cilium/cilium

cloud nativeKubernetesDevOpsmulti-clusterKarmada
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.