Cloud Native 27 min read

Large‑Scale etcd Cluster Performance Optimization and Pod Data Splitting in Ant Group’s Sigma

This article describes how Ant Group tackled the performance ceiling of its massive Sigma Kubernetes clusters by horizontally splitting etcd storage for Pods, Leases and Events, redesigning watch handling to avoid component restarts, and using snapshot‑based migration to preserve data integrity while reducing latency.

Architect
Architect
Architect
Large‑Scale etcd Cluster Performance Optimization and Pod Data Splitting in Ant Group’s Sigma

To support rapid business iteration, Ant Group launched the Gzone cloud‑migration project, requiring the new Gzone cluster to be co‑located with the already‑cloudified Rzone, resulting in a single Sigma cluster managing over ten thousand nodes and a dramatically increased workload.

The scale introduced three major challenges: (1) a massive number of short‑lived Pods created each minute, generating tens of thousands of etcd requests per day; (2) a variety of List, Watch, Create, Update and Delete operations whose latency grows sharply with etcd size, often causing OOM or time‑outs; (3) the surge in request volume amplifies the impact of etcd compact and defrag operations, leading to occasional component downtime and cluster unavailability.

Based on prior experience, the team identified horizontal data splitting of etcd as an effective optimization. By separating Pods, Leases and Events into four independent etcd clusters, the per‑cluster data volume and request QPS are reduced. However, traditional make‑mirror migration breaks the revision field, destroying resourceVersion and risking data loss for Pods, which have strict consistency requirements.

Further analysis revealed that component restarts are normally required because the client‑go ListAndWatch flow caches a watchRV (resourceVersion) that becomes larger than the server’s current version after a split, causing missed watch events. The kube‑apiserver watches also rely on this version, and a mismatch leads to lost Pod updates.

The breakthrough came from two angles: (1) on the server side, deliberately returning a TooLargeResourceVersionError forces client‑go to perform a fresh List and obtain the correct resourceVersion , eliminating the need for a full component restart; (2) on the data migration side, using etcd’s snapshot tool preserves the original KeyValue structure (including CreateRevision and ModRevision ) and, with a custom pruning step, extracts only the required Pod keys (e.g., those prefixed with /registry/Pods/ ), dramatically reducing snapshot size.

Key configuration snippets include the per‑resource etcd overrides and a MutatingWebhook that denies all Pod write operations during migration:

--etcd-servers-overrides=/events#https://etcd1.events.xxx:2xxx;https://etcd2.events.xxx:2xxx;https://etcd3.events.xxx:2xxx
--etcd-servers-overrides=coordination.k8s.io/leases#https://etcd1.leases.xxx:2xxx;https://etcd2.leases.xxx:2xxx;https://etcd3.leases.xxx:2xxx
--etcd-servers-overrides=/pods#https://etcd1.pods.xxx.net:2xxx;https://etcd2.pods.xxx:2xxx;https://etcd3.pods.xxx:2xxx
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: deny-pods-write
webhooks:
- admissionReviewVersions: [v1beta1]
  clientConfig:
    url: https://extensions.xxx/always-deny
  failurePolicy: Fail
  name: always-deny.extensions.k8s
  rules:
  - apiGroups: [""]
    apiVersions: [v1]
    operations: ["*"]
    resources: [pods, pods/status, pods/binding]
    scope: '*'
  sideEffects: NoneOnDryRun

After applying these changes, only the Pod resource is write‑blocked; reads and other resources continue normally. The migration completes in a few minutes, with no additional component restarts beyond the apiserver configuration reload.

In summary, by combining watch‑error‑driven relisting, snapshot‑based selective data extraction, and precise webhook denial, the team achieved a safe, low‑downtime, and performance‑effective data split for a >10k‑node Kubernetes fleet, demonstrating a practical cloud‑native optimization pattern.

data migrationcloud-nativekubernetesetcdoperatorscluster performance
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.