Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System
This article explains how Ant Financial designs a highly reliable, end‑state‑driven Kubernetes management platform that handles lifecycle operations, node self‑healing, and risk‑controlled changes for clusters with tens of thousands of nodes, using operators, custom resources, and a meta‑cluster architecture.
Kubernetes has become the leading container orchestration platform, and many companies, including Alibaba and Ant Financial, run it in production, but operating a highly available, large‑scale cluster remains challenging.
The management system must support convenient lifecycle actions—creation, upgrade, and node management—while ensuring controlled, observable changes; it also needs to handle frequent hardware failures and component anomalies in clusters that can exceed 10,000 nodes.
Ant Financial adopts an end‑state‑driven design inspired by negative‑feedback control: a periodic loop checks the current cluster state against a desired target, and Operators trigger actions to drive the cluster toward that target, providing resilience against external disturbances.
The architecture consists of a high‑availability meta‑cluster that manages N business clusters, a SigmaBoss UI for user interaction, and a “Kubernetes‑on‑Kubernetes” (KOK) approach where a Cluster‑Operator in the meta‑cluster creates, deletes, and upgrades business clusters via a Cluster CRD.
Core components include the Cluster‑Operator, which watches Cluster CRDs, and the ClusterPackageVersion CRD that records master component images and parameters, enabling seamless upgrades and rollbacks of business‑cluster control planes.
Node management relies on Machine CRDs that describe the desired state of each worker node, MachinePackageVersion CRDs that specify component versions, and a Machine‑Operator that reconciles these resources to install, upgrade, and maintain node software.
To coordinate multiple operators, the system introduces full‑list ReadinessGates, ConditionConfigMaps for sub‑state reporting, and a workflow where external operators report conditions, Machine‑Operator aggregates them, and only nodes meeting all conditions are marked schedulable.
For fault self‑healing, a closed‑loop system combines agent reports and active monitoring to detect failures, stores events centrally, isolates faulty nodes, tags pods for migration, performs hardware repair or OS reinstall, and finally restores scheduling, with manual intervention for unrecoverable cases.
Risk mitigation is achieved through cluster‑level gray‑scale changes, automated risk assessment using health checks and business metrics, and a unified throttling service that rate‑limits high‑risk operations such as node deletion or OS reinstall, automatically circuit‑breaking unsafe changes.
The article concludes by sharing the current Ant Financial Kubernetes management design, emphasizing operator‑based end‑state patterns, and outlining future work to extend these patterns to cluster‑scale changes for fully automated, observable, and rollback‑capable operations.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.