Cloud Native 15 min read

How Ant Financial Scales Kubernetes: Inside Their Kube‑on‑Kube Management System

This article explains how Ant Financial designs and operates a large‑scale, highly available Kubernetes management platform—using end‑state driven operators, custom CRDs, self‑healing mechanisms, and risk‑mitigation strategies—to reliably run thousands of nodes and dozens of business clusters in production.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How Ant Financial Scales Kubernetes: Inside Their Kube‑on‑Kube Management System

Background

Kubernetes simplifies containerized application deployment, but operating a production‑grade, highly available cluster at massive scale remains challenging. Ant Financial needed a management system capable of handling tens of thousands of nodes across dozens of business clusters while guaranteeing stability, upgradeability, and self‑recovery.

System Overview (Kube‑on‑Kube)

The solution adopts a Kube‑on‑Kube (KOK) architecture, where a highly available meta‑cluster manages multiple independent business clusters. SigmaBoss serves as the management entry point, providing a UI and a controlled change workflow. The meta‑cluster runs the management operators, while each business cluster runs its own workloads.

Design Pattern – End‑State Driven Control Loop

The system follows an end‑state driven design inspired by negative‑feedback control loops. Operators continuously observe the current state of a cluster, compare it with the desired target state, and trigger actions to converge the two. This closed‑loop approach isolates the system from external disturbances such as hardware or software faults.

Core Architecture

Two custom operators run in the meta‑cluster:

Cluster‑Operator – manages the lifecycle of business clusters. Each business cluster is represented by a Cluster Custom Resource Definition (CRD). The desired master component versions are stored in a ClusterPackageVersion CRD. Updating the ClusterPackageVersion triggers a rollout or rollback of the master components.

Machine‑Operator – manages worker nodes. A Machine CRD describes the desired end‑state of a node (installed components, kernel version, configuration, etc.). The component versions and installation parameters are defined in a MachinePackageVersion CRD. The operator watches Machine resources, resolves the associated package versions, and performs the necessary operations to bring the node to its desired state.

Node End‑State Management

Node lifecycle tasks include:

System configuration and kernel patching.

Installation, upgrade, or removal of Docker, kubelet, and other components.

Readiness gating – a node becomes schedulable only after all required conditions are satisfied.

Fault self‑healing.

Two coordination mechanisms are used:

ReadinessGates – a list of conditions that must be true before the node is marked Ready.

Condition ConfigMap – external operators report sub‑state information to a ConfigMap; the Machine‑Operator aggregates these reports into the node’s status.

Workflow:

External operators detect sub‑states (e.g., DaemonSet completion) and write them to their dedicated Condition ConfigMap.

The Machine‑Operator collects all related ConfigMaps, merges the data, and updates the Machine resource’s status.conditions.

The operator evaluates the full set of ReadinessGates; nodes that have not satisfied every gate remain unschedulable.

Node Fault Self‑Healing

Faults are detected through a combination of agent reports and active monitoring probes. All events are stored in a central event hub, allowing any interested component to subscribe. When a fault is identified, a repair workflow is launched:

Isolate the faulty node by disabling scheduling.

Tag the node’s Pods for migration; a migration controller moves the Pods to healthy nodes.

Execute the appropriate repair procedure (hardware replacement, OS reinstall, component rollback).

If automatic repair succeeds, the node is re‑enabled for scheduling; otherwise, the incident is escalated for manual intervention.

Risk Mitigation and Safe Change

On top of the core operators, the platform implements:

Cluster‑level gray‑scale rollouts and automated rollback based on health checks.

Pre‑change health verification of critical components (e.g., pod creation success rate).

A unified rate‑limiting service that throttles high‑risk operations (node deletion, OS reinstall) and can circuit‑break changes when thresholds are exceeded.

Continuous metric monitoring; abnormal metrics trigger automatic change abort.

Scalability

A single meta‑cluster can comfortably manage over 3,000 business clusters, each with up to ten thousand nodes. To handle the API‑server load during massive node scaling, the system reduces the number of full list/watch operations by caching package versions and aggregating condition reports.

Conclusion

The end‑state‑driven Kubernetes management system demonstrated stable performance during the Double 11 shopping festival. Future work will focus on improving overall resource utilization, such as increasing node online rates and reducing idle resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Kuberneteslarge scaleCluster Managementself-healingoperatorsKube-on-Kube
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.