Operations 29 min read

Why Kubernetes LIST Requests Can Cripple Your Cluster and How to Fix Them

This article examines how heavy LIST operations in unstructured storage systems like Ceph and etcd consume massive I/O, network and CPU, threaten cluster stability, and offers detailed code analysis, performance testing, and practical tuning recommendations to keep large‑scale Kubernetes clusters reliable.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Why Kubernetes LIST Requests Can Cripple Your Cluster and How to Fix Them

Introduction

For unstructured data stores, LIST operations are heavyweight, consuming large disk I/O, network bandwidth and CPU, and can degrade latency‑sensitive requests, becoming a major stability risk for clusters.

In Ceph object storage, each LIST bucket request scans many disks, slowing down other reads/writes. In etcd, even with modest data sizes, high concurrency (e.g., a 4000‑node Kubernetes cluster) can overwhelm the store unless a caching layer such as the apiserver is used.

apiserver/etcd LIST processing

The apiserver acts as a proxy in front of etcd. It first tries to serve LIST requests from its in‑memory cache; if the cache is unavailable or the request forces a direct read (e.g., missing resourceVersion=0), it falls back to etcd.

+--------+      +---------------+          +------------+
| Client | ---> | Proxy (cache) | -------> | Data store |
+--------+      +---------------+          +------------+

Key functions include List(), ListPredicate(), and the internal shouldDelegateList() check, which decides whether to read directly from etcd based on resourceVersion, pagination tokens, and limits.

Request handling paths

If a metadata.name is provided, the apiserver fetches a single object.

Otherwise it retrieves the full dataset and applies label/field selectors in memory.

When the cache is ready, the apiserver filters objects locally, dramatically reducing latency. When the cache is missing or the request forces a direct read, the full dataset (often gigabytes) is transferred from etcd.

Performance testing

Using curl scripts, the article benchmarks various LIST scenarios. For a 4000‑node, 100K‑pod cluster:

LIST without resourceVersion=0 (etcd read) takes ~10 s and processes ~2 GB of pod data.

LIST with resourceVersion=0 (cache read) returns in ~0.05 s, handling only a few hundred kilobytes.

Results show a >200× speed difference, highlighting the importance of cache‑based LISTs.

Recommendations

Always set resourceVersion=0 for LISTs unless strict consistency is required, so the apiserver serves from cache.

Prefer namespaced APIs to limit the key range scanned in etcd.

Implement restart back‑off for per‑node services (kubelet, cilium‑agent, etc.) to avoid mass restarts that flood the control plane.

Use label or field selectors so the apiserver can filter before returning data.

Monitor etcd for large LIST latency, memory, and bandwidth usage; set alerts for long‑running LISTs.

Testing scripts

The article provides Bash scripts ( curl-k8s-apiserver.sh and benchmark-list-overheads.sh) to measure request latency, data volume, and to generate size reports for each resource type.

Note: Full‑list of pods can generate multi‑gigabyte JSON files; use the benchmark tool cautiously to avoid overloading the control plane.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ScalabilityKubernetescachingListetcdapiserver
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.