Cloud Native 16 min read

Large‑Scale Kubernetes Deployment and Cloud‑Native Practices at Ant Financial

Ant Financial’s Kubernetes team built one of the world’s largest clusters—hundreds of thousands of nodes—by applying cloud‑native operators, GitOps, and extensive performance optimizations to achieve rapid, automated, and reliable large‑scale workloads during the 2019 Tmall 618 promotion.

AntTech
AntTech
AntTech
Large‑Scale Kubernetes Deployment and Cloud‑Native Practices at Ant Financial

In June 2019, Ant Financial applied Kubernetes to its scheduling system for the Tmall 618 promotion, scaling the cluster to hundreds of thousands of nodes across dozens of data centers, making it one of the world’s largest Kubernetes deployments.

The small Kubernetes team (about a dozen engineers) built the platform from scratch within a year, extending Kubernetes to align legacy scheduling functions and introduce new cloud‑native capabilities.

They created the Kube‑on‑Kube Operator, which runs business clusters (“service clusters”) inside a meta‑cluster, allowing minute‑level creation of new clusters and automatic master‑component recovery.

The Node‑Operator was developed to manage the full lifecycle of worker nodes—certificate generation, component installation, upgrades, and fault remediation—integrating with Node Problem Detector for automated fixes.

An automated CI/CD pipeline leverages these operators to spin up sandbox clusters for testing, execute end‑to‑end tests, and perform unattended releases; the team also adopted GitOps, using Git repositories, PR reviews, and kustomize to make resource manifests versioned, transparent, and deployable.

To sustain performance at massive scale, the team identified bottlenecks in the API Server, DaemonSets, and Webhooks, and implemented measures such as prioritizing API Server resources, load‑balancing upgrades, enabling the NodeLease feature, fixing missing Context handling, enforcing informer‑based client access, and contributing Priority‑and‑Fairness controls.

These practices standardized delivery, boosted developer productivity, and improved resource utilization, and many of the optimizations have been contributed back to the open‑source Kubernetes community.

cloud-nativeOperatorGitOpsperformance-optimizationlarge-scale
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.