Cloud Native 16 min read

How Ant Financial Scales Kubernetes: Inside Their Cloud‑Native Cluster Management System

This article explains how Ant Financial designs and operates a large‑scale, highly available Kubernetes management platform—detailing its architecture, core operators, desired‑state controllers, fault‑self‑healing mechanisms, risk mitigation, and practical Q&A for production environments.

dbaplus Community
dbaplus Community
dbaplus Community
How Ant Financial Scales Kubernetes: Inside Their Cloud‑Native Cluster Management System

System Overview

Kubernetes has lowered the barrier for containerized application deployment, yet operating a production‑grade, highly available cluster remains challenging. Ant Financial shares how it reliably manages massive Kubernetes clusters, focusing on lifecycle management, upgradeability, and fault tolerance.

Core Components

Cluster Desired‑State Keeper

Using a custom Cluster CRD in a meta‑cluster, each business cluster is represented by a Cluster resource. The Cluster‑Operator watches these resources and drives the master components of the business cluster to match the desired state defined in ClusterPackageVersion, enabling creation, deletion, upgrade, and rollback.

Node Desired‑State Keeper

Node management tasks include system configuration, kernel patching, component installation (docker/kubelet), readiness gating, and fault self‑healing. A Machine CRD describes the desired state of each worker node; a MachinePackageVersion CRD records component versions, configurations, and install methods. The Machine‑Operator watches Machine resources, applies the specified packages, and continuously ensures nodes reach and stay in the target state.

System configuration and kernel patch management

Docker/kubelet installation, upgrade, removal

Readiness gating (e.g., enable scheduling only after critical DaemonSets are ready)

Node fault self‑healing

Node Desired‑State Management

To avoid tight coupling between the Machine‑Operator and other operators, a coordination mechanism was introduced:

Full‑list ReadinessGates records all conditions a node must satisfy before becoming schedulable. Condition ConfigMap stores sub‑state reports from external operators.

External operators report their sub‑state to the corresponding Condition ConfigMap. Machine‑Operator aggregates these ConfigMaps into the node’s status.conditions.

It then checks the full ReadinessGates list; nodes failing any condition remain unschedulable.

Node Fault Self‑Healing

Given the high probability of hardware failures in large clusters, a closed‑loop self‑healing system detects, isolates, and repairs faulty nodes. Fault detection combines agent reports with active monitoring, storing events in a central event hub. Different repair workflows (hardware maintenance, OS reinstall, etc.) are triggered based on fault type.

During repair, the faulty node is cordoned, its Pods are labeled for migration, and after successful restoration the node is uncordoned. Nodes that cannot be automatically repaired are escalated for manual intervention.

Node fault self‑healing architecture
Node fault self‑healing architecture

Risk Mitigation

On top of the atomic capabilities provided by the Machine‑Operator, the system implements cluster‑wide gray‑scale changes and rollback. Before any real change, operators perform risk assessment; high‑risk actions (e.g., node deletion, OS reinstall) go through a unified rate‑limiting service that can circuit‑break the operation.

Health checks are run before and after changes, and business‑level metrics (e.g., pod creation success rate) are monitored. If anomalies are detected, the change is automatically aborted.

Risk assessment and rate limiting
Risk assessment and rate limiting

Conclusion

The presented design demonstrates a production‑grade, operator‑centric, desired‑state management platform that has withstood the performance and stability demands of Ant Financial’s Double‑11 traffic peak. By ensuring cluster stability, operational efficiency, and higher resource utilization, the system paves the way for further improvements in node online rates and reduced idle resources.

Q&A

Q1: How can Docker‑only workloads migrate to Kubernetes?

A1: Since Docker containers already meet cloud‑native criteria, migration is relatively smooth. Ant Financial added custom enhancements to Kubernetes to accommodate legacy requirements and ensure seamless transition.

Q2: Does running workloads in Kubernetes affect performance, especially for big‑data tasks?

A2: Docker adds minimal overhead. Ant Financial runs big‑data and AI workloads on Kubernetes, leveraging idle capacity to lower data‑center costs without noticeable performance loss.

Q3: How to combine traditional ops environments with Kubernetes?

A3: Ant Financial built an “Adapter” that translates legacy container creation commands into Kubernetes resource updates, providing a bridge between the two worlds.

Q4: How is node monitoring handled and are Pods migrated automatically on node failure?

A4: Monitoring spans hardware, system, and component levels. On node anomalies, Pods are automatically migrated; for stateful workloads, custom operators handle migration, otherwise Pods may be terminated after a timeout.

Q5: Will Kubernetes become transparent to developers, allowing code‑first cluster programming?

A5: While direct code‑first deployment is still complex, a DSL‑based approach on top of Kubernetes is anticipated as the future trend.

Q6: What are the advantages of kube‑on‑kube versus kube‑to‑kube, and how are API‑server performance bottlenecks addressed?

A6: Kube‑on‑kube lets business clusters be managed like regular apps. Optimizations focus on reducing massive list/watch traffic from new nodes and improving API‑server scalability.

Q7: What benefits does Kubernetes bring to organizations that have not yet adopted it?

A7: Its desired‑state model simplifies complex operations, enabling smoother upgrades and more reliable management.

Q8: Are cluster operators run as Pods while machine operators run on physical machines?

A8: All operators run inside Pods; the cluster operator launches the machine operator Pods as needed.

Q9: How many business clusters can a meta‑cluster manage, and what optimizations support large‑scale watch traffic?

A9: One meta‑cluster can manage tens of thousands of nodes, roughly 3K+ business clusters, with API‑server performance tuning and reduced list/watch overhead.

Q10: How does the system maximize node reliability when encountering kernel, Docker, or K8s failures?

A10: Nodes perform health checks and self‑evict; Kubernetes detects the failure and reschedules workloads on healthy nodes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeautomationKuberneteslarge scaleCluster Managementself-healingoperators
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.