How Ant Financial Automates Tens of Thousands of Kubernetes Nodes with Operators
Ant Financial tackles the challenge of managing dozens of Kubernetes clusters and over a hundred thousand worker nodes by employing a meta‑cluster with Kube‑on‑Kube and Node Operators, enabling automated lifecycle management, scaling, upgrades, and fault recovery for both master components and worker nodes.
1. Prerequisite Knowledge
This section briefly introduces the Kubernetes architecture for readers new to the platform. A Kubernetes cluster consists of a set of Master nodes and many Worker nodes. Masters run etcd, apiserver, scheduler, and controller‑manager as static Pods, typically three replicas each for high availability. Workers run Pods and require on‑host components such as kubelet, a container runtime (Docker, Pouch, etc.), and CNI plugins.
2. Background
Ant Financial needed to operate dozens of Kubernetes clusters with more than 100,000 Worker nodes. The operational workload was split into two parts:
Managing the Master components of each cluster (etcd, apiserver, controller‑manager, scheduler, etc.).
Managing the Worker nodes.
Key challenges included rapid creation and deletion of clusters, version management of Master components, automated fault handling, unified status view, and large‑scale Worker node lifecycle operations such as onboarding, upgrades, gray‑release, and fault recovery.
3. Implementation
Ant Financial adopted a combination of Kube‑on‑Kube‑Operator and Node‑Operator to address the challenges.
Kube‑on‑Kube‑Operator watches Cluster CRD resources. When a user submits a Cluster CRD describing a business cluster, the operator creates the required Master components inside a meta‑cluster (the “元集群”). The meta‑cluster hosts the Master components for all business clusters, and its Worker nodes serve as the Masters of the business clusters.
Node‑Operator watches Machine CRD resources. When a Machine CRD describing a Worker node is submitted, the operator installs the necessary on‑host software (docker, kubelet, CNI, etc.) and brings the node to a Ready state in the target business cluster.
Example etcd‑operator CRD:
apiVersion: etcd.database.coreos.com/v1beta2
kind: EtcdCluster
metadata:
name: xxx-etcd-cluster
spec:
size: 5Setting spec.size=5 creates a five‑node etcd cluster; changing it to 3 would create a three‑node cluster.
Kube‑on‑Kube‑Operator Details
The Cluster CRD captures information such as business cluster name, deployment mode (standard or minimal), Master node selector, component versions, etcd volume configuration (ClaimTemplate or VolumeSource), certificate expiration handling, extra user kubeconfig, and status generated by the operator.
Master components are realized as Deployments, Services, Pods, PVCs, and Secrets within the meta‑cluster. For example, the apiserver uses a Deployment plus a headless Service for intra‑cluster communication and a DNS‑RR Service for external access.
Node‑Operator Details
The Machine CRD records metadata (IP, hostname, IDC), SSH login method and credentials, on‑host software versions, and generated status. The operator monitors node conditions, performs automated recovery, synchronizes component versions based on a ClusterPackageVersion CRD, and supports gray‑release of software to selected nodes.
4. Conclusion
By replacing traditional manual processes with Operator‑based automation, Ant Financial achieved “Kubernetes as a Service” for both cluster lifecycle and Worker node management. Cluster creation, scaling, upgrades, and fault recovery are now handled programmatically via the apiserver, simplifying operations at massive scale.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
