Why Enterprises Need Multi‑Cluster Kubernetes and How to Implement It
This article explains why modern enterprises adopt multiple Kubernetes clusters, covering single‑cluster capacity limits, hybrid‑cloud requirements, fault‑tolerance concerns, the benefits of multi‑cluster setups, architectural models, and community‑driven implementation patterns.
As Kubernetes becomes increasingly adopted in enterprises, many companies operate multiple clusters in production. This article discusses considerations for multi‑cluster Kubernetes, including why to choose it, its benefits, and implementation approaches.
VMware 2020 Kubernetes Usage Report noted that 20% of organizations run more than 40 clusters.
Why enterprises need multiple clusters?
Single‑cluster capacity limits
Official documentation for v1.12 states that a Kubernetes cluster supports up to 5,000 nodes, 150,000 pods, 300,000 containers, and no more than 100 pods per node. This limit has not changed up to v1.20, indicating that increasing single‑cluster capacity is not a community focus.
If a workload exceeds 5,000 machines, enterprises must consider multiple clusters.
Hybrid‑cloud or multi‑cloud architectures
Multi‑cloud or hybrid‑cloud setups are common. Global companies may run services across regions, or combine on‑premises data centers with public clouds such as Alibaba Cloud for elastic traffic. Public clouds also have finite resources and require advance provisioning for large promotions.
To avoid vendor lock‑in and control costs, many enterprises adopt multi‑cloud architectures, which naturally lead to multiple clusters.
Don’t put all eggs in one basket
Deploying all workloads to a single cluster creates a single point of failure. If the control plane fails, all services are impacted. Although the control plane is designed to be highly available, production incidents have shown that heavy API‑server traffic can cause outages.
Therefore, production environments need strict API‑server access controls, thorough testing, and possibly separating business workloads from infrastructure.
Benefits of multiple clusters
Multiple clusters improve:
Availability
Isolation
Scalability
Multi‑cluster application architectures
Two common models:
Replica model: Deploy full copies of an application to several availability zones or data centers. Smart DNS or global load balancers route traffic to the nearest healthy cluster, providing low latency and failover.
Service‑based partitioning: Deploy services to different clusters based on business relevance, offering strong isolation at the cost of more complex service division.
Community implementation patterns
Two main approaches are being explored.
Kubernetes‑centric
Extending core Kubernetes primitives to support multi‑cluster use cases, as done by the Kubernetes Cluster Federation project. Federation provides a logical control plane that orchestrates multiple master nodes, enabling cross‑cluster resource distribution and multi‑cluster service discovery.
Federation achieves:
Cross‑cluster resource propagation using Templates, Placement, and Overrides, allowing Deployments to be distributed and scaled across clusters.
Multi‑cluster service discovery for Services and Ingresses (currently alpha).
Network‑centric
This approach focuses on establishing network connections between clusters so that workloads can communicate directly. Service‑mesh solutions such as Istio, Linkerd, and Consul Mesh provide multi‑cluster traffic management. Cilium’s Cluster Mesh offers pod‑IP routing across clusters via tunnels or direct routing, without requiring gateways.
Each cluster maintains its own etcd; states never mix.
Etcd proxies expose cluster state; Cilium agents in other clusters watch changes and replicate relevant state.
Cross‑cluster access is read‑only, preventing fault propagation.
Configuration is done via a simple Kubernetes Secret containing etcd proxy address, cluster name, and certificates.
Reflection
The two patterns are not mutually exclusive. In practice, many companies combine them: use federation for deployment and release, and a service mesh for cross‑cluster traffic. Workloads, the mesh control plane, and gateways must integrate with external registries. The diagram below illustrates such a combined architecture.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.