Why Kubernetes LIST Requests Can Cripple Your Cluster and How to Fix Them
This article examines how heavy LIST operations in unstructured storage systems like Ceph and etcd consume massive I/O, network and CPU, threaten cluster stability, and offers detailed code analysis, performance testing, and practical tuning recommendations to keep large‑scale Kubernetes clusters reliable.
Introduction
For unstructured data stores, LIST operations are heavyweight, consuming large disk I/O, network bandwidth and CPU, and can degrade latency‑sensitive requests, becoming a major stability risk for clusters.
In Ceph object storage, each LIST bucket request scans many disks, slowing down other reads/writes. In etcd, even with modest data sizes, high concurrency (e.g., a 4000‑node Kubernetes cluster) can overwhelm the store unless a caching layer such as the apiserver is used.
apiserver/etcd LIST processing
The apiserver acts as a proxy in front of etcd. It first tries to serve LIST requests from its in‑memory cache; if the cache is unavailable or the request forces a direct read (e.g., missing resourceVersion=0), it falls back to etcd.
+--------+ +---------------+ +------------+
| Client | ---> | Proxy (cache) | -------> | Data store |
+--------+ +---------------+ +------------+Key functions include List(), ListPredicate(), and the internal shouldDelegateList() check, which decides whether to read directly from etcd based on resourceVersion, pagination tokens, and limits.
Request handling paths
If a metadata.name is provided, the apiserver fetches a single object.
Otherwise it retrieves the full dataset and applies label/field selectors in memory.
When the cache is ready, the apiserver filters objects locally, dramatically reducing latency. When the cache is missing or the request forces a direct read, the full dataset (often gigabytes) is transferred from etcd.
Performance testing
Using curl scripts, the article benchmarks various LIST scenarios. For a 4000‑node, 100K‑pod cluster:
LIST without resourceVersion=0 (etcd read) takes ~10 s and processes ~2 GB of pod data.
LIST with resourceVersion=0 (cache read) returns in ~0.05 s, handling only a few hundred kilobytes.
Results show a >200× speed difference, highlighting the importance of cache‑based LISTs.
Recommendations
Always set resourceVersion=0 for LISTs unless strict consistency is required, so the apiserver serves from cache.
Prefer namespaced APIs to limit the key range scanned in etcd.
Implement restart back‑off for per‑node services (kubelet, cilium‑agent, etc.) to avoid mass restarts that flood the control plane.
Use label or field selectors so the apiserver can filter before returning data.
Monitor etcd for large LIST latency, memory, and bandwidth usage; set alerts for long‑running LISTs.
Testing scripts
The article provides Bash scripts ( curl-k8s-apiserver.sh and benchmark-list-overheads.sh) to measure request latency, data volume, and to generate size reports for each resource type.
Note: Full‑list of pods can generate multi‑gigabyte JSON files; use the benchmark tool cautiously to avoid overloading the control plane.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
