Scaling Kubernetes at Ant Financial: The Kube‑on‑Kube Architecture Explained
Ant Financial’s article details how its large‑scale Kubernetes management system—built on a meta‑cluster, end‑state operators, and a Kube‑on‑Kube design—ensures reliable creation, upgrade, and self‑healing of thousands of nodes, while providing gray‑scale changes, risk assessment, and fault‑tolerant automation.
Kubernetes has dramatically lowered the barrier for containerized application deployment, and its advanced design has made it the leading solution for container orchestration. Ant Financial (蚂蚁金服) shares how it reliably manages massive Kubernetes clusters in production.
System Overview
The cluster management system must support convenient lifecycle operations such as creation, upgrade, and worker‑node management. In large‑scale scenarios, controlled cluster changes directly affect stability, so the system emphasizes monitoring, gray‑scale rollout, and rollback capabilities. It also needs to handle frequent hardware failures and component anomalies in clusters with tens of thousands of nodes.
Design Pattern
Based on these requirements, a terminal‑state‑oriented management system was designed. Operators periodically check the current cluster state, compare it with the desired target state, and trigger a series of actions to drive the cluster toward the target. This follows a negative‑feedback closed‑loop control model, effectively resisting external disturbances such as node hardware failures.
Architecture
A high‑availability meta‑cluster manages the master nodes of multiple business clusters. The business clusters run production workloads. SigmaBoss serves as the management entry point, offering a user‑friendly UI and controlled change workflow. The meta‑cluster runs a Cluster‑Operator that provides creation, deletion, and upgrade capabilities for business clusters. This “Kubernetes‑on‑Kubernetes” (KOK) approach is referred to as the Kube‑on‑Kube solution.
Within each business cluster, a Machine‑Operator and node‑fault self‑healing components manage worker nodes, providing node addition, deletion, upgrade, and fault handling.
Core Components
In the meta‑cluster, a Cluster CRD describes the desired terminal state of a business cluster. Creating, deleting, or updating a Cluster resource triggers the corresponding actions. Cluster‑Operator watches these resources and drives the master components to reach the defined state. The master component versions are stored in a ClusterPackageVersion CRD, which records images and default parameters for components such as api‑server, controller‑manager, scheduler, and operators. Updating the ClusterPackageVersion in a Cluster resource performs a rollout or rollback.
Worker nodes are described by a Machine CRD. Each Machine represents a node and contains the desired component list and versions. The MachinePackageVersion CRD holds rpm versions, configurations, and installation methods for each component. Machine‑Operator watches Machine resources, resolves the associated MachinePackageVersion, and executes the necessary operations on the node to achieve and maintain the terminal state.
Node Final‑State Management
Node management tasks include system configuration, kernel patching, docker/kubelet lifecycle, schedulability control, and fault self‑healing. To coordinate multiple operators, the system introduces:
Full‑list ReadinessGates that record the conditions a node must satisfy before becoming schedulable. Condition ConfigMap where external operators report sub‑state data.
The coordination flow is:
External operators detect and write their sub‑state to the corresponding Condition ConfigMap. Machine‑Operator aggregates all condition ConfigMaps into the Machine status.
Based on ReadinessGates, the operator checks whether the node has reached the terminal state; if not, scheduling remains disabled.
Node Fault Self‑Healing
With thousands of nodes, hardware failures and component anomalies are common. A closed‑loop fault‑healing system was built:
Fault detection combines agent reporting (real‑time) and active monitoring (covers agent failures).
All fault events are stored in a central event hub, allowing any interested component to subscribe.
Repair workflows (e.g., hardware maintenance, OS reinstall) are automatically triggered, isolating the faulty node, labeling its pods for migration, and attempting recovery. Nodes that cannot be auto‑repaired are escalated for manual intervention.
Risk Mitigation
On top of the terminal‑state operators, the system provides cluster‑level gray‑scale changes and rollback. Before any real change, a risk assessment is performed: health checks of components and business metrics (e.g., pod creation success rate) are collected from the event hub and monitoring system. If anomalies are detected, the change is automatically circuit‑broken. High‑risk operations such as node deletion or OS reinstall are routed through a unified rate‑limiting center that can throttle or abort the change.
Q&A
Q1: How to migrate Docker‑based applications to Kubernetes? A1: Ant Financial’s migration path uses Kubernetes as a PaaS framework; existing Docker workloads can be gradually shifted, with adapters translating legacy commands into Kubernetes resources.
Q2: Does running on Kubernetes affect performance for big‑data tasks? A2: Docker adds minimal overhead; large‑scale batch jobs can share idle cluster resources, reducing data‑center costs.
Q3: How to combine Kubernetes with traditional operations? A3: Ant built an “Adapter” that converts traditional container commands into Kubernetes resource updates, enabling a unified control plane.
Q4: How does node monitoring and pod migration work? A4: Hardware, system, and component metrics are collected via agents and exporters; when a node fails, pods are automatically migrated, and long‑lived pods can implement custom operators for graceful migration.
Q5: Will Kubernetes become transparent to developers? A5: Future plans involve DSL‑based deployment on top of Kubernetes, making the platform the underlying infrastructure.
Q6: Advantages of kube‑on‑kube vs. kube‑to‑kube? A6: Kube‑on‑kube treats business clusters like regular apps, simplifying management; performance bottlenecks in massive node scaling are addressed by optimizing apiserver list/watch traffic.
Q7: How many business clusters can a meta‑cluster manage? A7: One meta‑cluster can manage tens of thousands of nodes, theoretically supporting over 3,000 business clusters.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
