How PayPal Scaled Kubernetes to 4,100 Nodes and 200k Pods
PayPal’s engineering team detailed their journey of scaling Kubernetes from a few hundred nodes to over 4,100 nodes and 200,000 Pods, describing cluster topology, workload generation, API server bottlenecks, controller manager and scheduler tuning, extensive etcd optimizations, and the resulting performance gains that met Kubernetes SLOs.
Abstract
PayPal evaluated Kubernetes performance, scalability, and control‑plane behavior while migrating workloads from Apache Mesos.
Cluster topology
The production clusters run on Google Cloud Platform with several thousand nodes. Each cluster uses three master nodes, an external three‑node etcd cluster, and a load balancer in front of the control plane. All worker nodes reside in the same region as the control plane.
Workloads
Performance testing used the open‑source k‑bench workload generator, customized to create simple Pods and Deployments in batches of varying size and interval.
Scaling process
Starting with a modest number of Pods and nodes, the cluster was iteratively expanded. Each worker node provides four CPU cores and can host up to 40 Pods. The final scale reached roughly 4,100 nodes supporting 150 k–200 k Pods. Node core counts were increased as needed to accommodate the higher Pod density.
API server bottlenecks
The API server became the primary bottleneck, producing 504 gateway‑timeout errors and exponential client‑side throttling. Sample throttling logs:
I0504 17:54:55.731559 1 request.go:655] Throttling request took 1.005397106s, request: POST:https://<>:443/api/v1/namespaces/kbench-deployment-namespace-14/Pods..
I0504 17:55:05.741655 1 request.go:655] Throttling request took 7.38390786s, request: POST:https://<>:443/api/v1/namespaces/kbench-deployment-namespace-13/Pods..
I0504 17:55:15.749891 1 request.go:655] Throttling request took 13.522138087s, request: POST:https://<>:443/api/v1/namespaces/kbench-deployment-namespace-13/Pods..
I0504 17:55:25.759662 1 request.go:655] Throttling request took 19.202229311s, request: POST:https://<>:443/api/v1/namespaces/kbench-deployment-namespace-20/Pods..
I0504 17:55:35.760088 1 request.go:655] Throttling request took 25.409325008s, request: POST:https://<>:443/api/v1/namespaces/kbench-deployment-namespace-13/Pods..
I0504 17:55:45.769922 1 request.go:655] Throttling request took 31.613720059s, request: POST:https://<>:443/api/v1/namespaces/kbench-deployment-namespace-6/Pods..Queue sizes are controlled by max‑mutating‑requests‑inflight and max‑requests‑inflight. Kubernetes 1.20 introduced a priority‑and‑fairness feature that partitions the total queue across priority classes, allowing leader‑election traffic to outrank Pod requests.
Controller manager tuning
kube‑api‑qps– maximum queries per second the controller manager may issue to the API server. kube‑api‑burst – burst capacity above the QPS limit. concurrent‑deployment‑syncs – concurrency of Deployment/ReplicaSet synchronizations.
Scheduler adjustments
In isolation the scheduler can sustain ~1,000 Pods/s, but in the live cluster throughput dropped because slow etcd instances increased binding latency, inflating the pending queue to thousands of Pods. The pending queue was kept below 100 during tests, and leader‑election parameters were tuned to mitigate transient network partitions.
etcd optimizations
etcd proved to be the most critical component. Initial failures were caused by GCP PD‑SSD disks limited to ~100 MiB/s. Switching to local SSDs increased raw throughput but introduced ext4 write‑barrier latency.
Disabling the write barrier and adjusting WAL settings brought local SSD performance close to PD‑SSD levels.
LOCAL SSD Summary: Total: 8.1841 secs. Slowest: 0.5171 secs. Fastest: 0.0332 secs. Average: 0.0815 secs. Stddev: 0.0259 secs. Requests/sec: 12218.8374
PD SSD Summary: Total: 4.6773 secs. Slowest: 0.3412 secs. Fastest: 0.0249 secs. Average: 0.0464 secs. Stddev: 0.0187 secs. Requests/sec: 21379.7235Further tuning raised local SSD throughput to ~23,910 requests/sec.
etcd’s default MVCC database size is 2 GB and can grow to 8 GB under pressure; with ~60 % utilization the cluster could support 200 k stateless Pods.
A liveness‑probe bug in etcd v1.20 was mitigated by increasing the failure‑threshold count.
Heavy range queries on resources such as Events, Pods, ConfigMaps, Deployments, Leases, and Nodes caused significant backend latency. Sharding etcd per resource type improved stability under high Pod competition.
Results
After tuning API server queues, controller manager parameters, scheduler behavior, and especially etcd storage, latency improved dramatically. With a workload of 150 k Pods (250 replicas per Deployment, 10 concurrent workers), P99 Pod startup latency stayed under 5 seconds, meeting Kubernetes SLOs. At 200 k Pods the API call latency fully satisfied the SLO.
Conclusion
Scaling Kubernetes to tens of thousands of nodes and hundreds of thousands of Pods requires deep insight into the control plane and systematic component tuning. The optimizations described demonstrate that a well‑tuned control plane can sustain large‑scale clusters.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Cloud Native Technology Community
The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
