How Ant Financial Scales Kubernetes: Inside Their Cloud‑Native Cluster Management System
This article explains how Ant Financial designs and operates a large‑scale, highly available Kubernetes management platform—detailing its architecture, core operators, desired‑state controllers, fault‑self‑healing mechanisms, risk mitigation, and practical Q&A for production environments.
System Overview
Kubernetes has lowered the barrier for containerized application deployment, yet operating a production‑grade, highly available cluster remains challenging. Ant Financial shares how it reliably manages massive Kubernetes clusters, focusing on lifecycle management, upgradeability, and fault tolerance.
Core Components
Cluster Desired‑State Keeper
Using a custom Cluster CRD in a meta‑cluster, each business cluster is represented by a Cluster resource. The Cluster‑Operator watches these resources and drives the master components of the business cluster to match the desired state defined in ClusterPackageVersion, enabling creation, deletion, upgrade, and rollback.
Node Desired‑State Keeper
Node management tasks include system configuration, kernel patching, component installation (docker/kubelet), readiness gating, and fault self‑healing. A Machine CRD describes the desired state of each worker node; a MachinePackageVersion CRD records component versions, configurations, and install methods. The Machine‑Operator watches Machine resources, applies the specified packages, and continuously ensures nodes reach and stay in the target state.
System configuration and kernel patch management
Docker/kubelet installation, upgrade, removal
Readiness gating (e.g., enable scheduling only after critical DaemonSets are ready)
Node fault self‑healing
Node Desired‑State Management
To avoid tight coupling between the Machine‑Operator and other operators, a coordination mechanism was introduced:
Full‑list ReadinessGates records all conditions a node must satisfy before becoming schedulable. Condition ConfigMap stores sub‑state reports from external operators.
External operators report their sub‑state to the corresponding Condition ConfigMap. Machine‑Operator aggregates these ConfigMaps into the node’s status.conditions.
It then checks the full ReadinessGates list; nodes failing any condition remain unschedulable.
Node Fault Self‑Healing
Given the high probability of hardware failures in large clusters, a closed‑loop self‑healing system detects, isolates, and repairs faulty nodes. Fault detection combines agent reports with active monitoring, storing events in a central event hub. Different repair workflows (hardware maintenance, OS reinstall, etc.) are triggered based on fault type.
During repair, the faulty node is cordoned, its Pods are labeled for migration, and after successful restoration the node is uncordoned. Nodes that cannot be automatically repaired are escalated for manual intervention.
Risk Mitigation
On top of the atomic capabilities provided by the Machine‑Operator, the system implements cluster‑wide gray‑scale changes and rollback. Before any real change, operators perform risk assessment; high‑risk actions (e.g., node deletion, OS reinstall) go through a unified rate‑limiting service that can circuit‑break the operation.
Health checks are run before and after changes, and business‑level metrics (e.g., pod creation success rate) are monitored. If anomalies are detected, the change is automatically aborted.
Conclusion
The presented design demonstrates a production‑grade, operator‑centric, desired‑state management platform that has withstood the performance and stability demands of Ant Financial’s Double‑11 traffic peak. By ensuring cluster stability, operational efficiency, and higher resource utilization, the system paves the way for further improvements in node online rates and reduced idle resources.
Q&A
Q1: How can Docker‑only workloads migrate to Kubernetes?
A1: Since Docker containers already meet cloud‑native criteria, migration is relatively smooth. Ant Financial added custom enhancements to Kubernetes to accommodate legacy requirements and ensure seamless transition.
Q2: Does running workloads in Kubernetes affect performance, especially for big‑data tasks?
A2: Docker adds minimal overhead. Ant Financial runs big‑data and AI workloads on Kubernetes, leveraging idle capacity to lower data‑center costs without noticeable performance loss.
Q3: How to combine traditional ops environments with Kubernetes?
A3: Ant Financial built an “Adapter” that translates legacy container creation commands into Kubernetes resource updates, providing a bridge between the two worlds.
Q4: How is node monitoring handled and are Pods migrated automatically on node failure?
A4: Monitoring spans hardware, system, and component levels. On node anomalies, Pods are automatically migrated; for stateful workloads, custom operators handle migration, otherwise Pods may be terminated after a timeout.
Q5: Will Kubernetes become transparent to developers, allowing code‑first cluster programming?
A5: While direct code‑first deployment is still complex, a DSL‑based approach on top of Kubernetes is anticipated as the future trend.
Q6: What are the advantages of kube‑on‑kube versus kube‑to‑kube, and how are API‑server performance bottlenecks addressed?
A6: Kube‑on‑kube lets business clusters be managed like regular apps. Optimizations focus on reducing massive list/watch traffic from new nodes and improving API‑server scalability.
Q7: What benefits does Kubernetes bring to organizations that have not yet adopted it?
A7: Its desired‑state model simplifies complex operations, enabling smoother upgrades and more reliable management.
Q8: Are cluster operators run as Pods while machine operators run on physical machines?
A8: All operators run inside Pods; the cluster operator launches the machine operator Pods as needed.
Q9: How many business clusters can a meta‑cluster manage, and what optimizations support large‑scale watch traffic?
A9: One meta‑cluster can manage tens of thousands of nodes, roughly 3K+ business clusters, with API‑server performance tuning and reduced list/watch overhead.
Q10: How does the system maximize node reliability when encountering kernel, Docker, or K8s failures?
A10: Nodes perform health checks and self‑evict; Kubernetes detects the failure and reschedules workloads on healthy nodes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
