Cloud Native 12 min read

How PayPal Scaled Kubernetes to 4,000 Nodes and 200,000 Pods

PayPal’s engineering team detailed their journey of scaling Kubernetes from a few hundred nodes to over 4,000 nodes and 200,000 pods, describing the cluster topology, workload generation, bottlenecks in the API server, controller manager, scheduler, and etcd, and the optimizations that enabled stable performance at massive scale.

Open Source Linux

Mar 17, 2022

How PayPal Scaled Kubernetes to 4,000 Nodes and 200,000 Pods

Abstract

PayPal recently began experimenting with Kubernetes, moving workloads that previously ran on Apache Mesos. To assess performance and scalability, they evaluated the Kubernetes control plane, focusing on platform scalability and identifying improvement areas.

Introduction

The article, originally published on the PayPal tech blog, explains why PayPal needed to understand Kubernetes performance as part of its migration from Mesos.

Cluster Topology

Production clusters run on Google Cloud Platform (GCP) with three master nodes and an external three‑node etcd cluster. A load balancer fronts the control plane, and all data nodes reside in the same region as the control plane.

Workload Generation

Performance tests used the open‑source k‑bench workload generator, modified for PayPal’s scenarios. Simple Pods and Deployments were created in batches of varying size and interval.

Scaling Experiments

Starting with a small number of Pods and nodes, the team incrementally increased scale, observing performance improvements. Each worker node had four CPU cores and could host up to 40 Pods. The cluster grew to roughly 4,100 nodes, 150,000 Pods, and eventually 200,000 Pods, requiring additional CPU cores per node.

API Server Bottleneck

The API server became a bottleneck, with many requests returning 504 gateway‑time‑out errors and local client throttling. The queue size is controlled by max-mutating-requests-inflight and max-requests-inflight. Kubernetes 1.20 introduced a priority‑and‑fairness feature that partitions the queue among different request classes.

I0504 17:54:55.731559 1 request.go:655] Throttling request took 1.005397106s, request: POST: https://<...>/api/v1/namespaces/kbench-deployment-namespace-14/Pods..

Controller Manager Tuning

kube-api-qps

– queries per second the controller manager may send to the API server. kube-api-burst – burst capacity above the QPS limit. concurrent-deployment-syncs – concurrency of deployment and replica‑set sync operations.

Scheduler Performance

When tested in isolation, the scheduler can handle 1,000 Pods per second, but real‑world cluster deployment showed reduced throughput due to slow etcd instances, which increased pending queue size to thousands of Pods.

etcd Challenges

etcd proved to be the most critical component. GCP PD‑SSD disks were limited to ~100 MiB/s throughput, becoming a bottleneck despite the small storage needs of etcd. Switching to local SSDs improved raw I/O but introduced write‑barrier latency. Disabling ext4 write barriers and tuning WAL sync reduced latency dramatically.

Plain TextLOCAL SSDSummary: Total: 8.1841 secs. Slowest: 0.5171 secs. Fastest: 0.0332 secs. Average: 0.0815 secs. Stddev: 0.0259 secs. Requests/sec: 12218.8374

After disabling write barriers, local SSD performance matched PD‑SSD, and etcd’s MVCC database (default 2 GiB, expandable to 8 GiB) operated at ~60 % utilization, allowing scaling to 200 k stateless Pods.

Results

Post‑optimisation, latency improved significantly. With a workload of 150 k Pods (250 replicas per deployment, 10 concurrent workers), the P99 Pod‑startup latency stayed under 5 seconds, meeting Kubernetes SLOs. At 200 k Pods, API call latency fully satisfied the SLO.

We achieved ~5 second P99 startup latency for 200 k Pods, far exceeding the 3 000 Pods/min rate claimed by Kubernetes for a 5 k‑node test.

Conclusion

Kubernetes is a complex system that requires deep understanding of its control plane to scale each component effectively. PayPal’s experience highlights the importance of tuning the API server, controller manager, scheduler, and especially etcd to achieve stable performance at massive scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Cloud Native kubernetes scaling Etcd PayPal

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.