Scaling Kubernetes from 1k to 5k Nodes: Complete Performance Tuning Playbook
This article presents a comprehensive, real‑world guide for expanding a Kubernetes cluster from 1,000 to 5,000 nodes, covering control‑plane HA, etcd optimization, network and scheduler tuning, monitoring, and automation, with detailed configurations, code snippets, and a step‑by‑step case study of a large‑scale production environment.
Introduction
Kubernetes is the de‑facto container orchestration platform, but when a cluster grows beyond a thousand nodes performance bottlenecks appear in the control plane, etcd, network and scheduler. This guide presents a systematic, production‑grade methodology for scaling a cluster from 1,000 to 5,000 nodes.
Core Challenges in Large‑Scale Clusters
Control‑plane components (API Server, Scheduler, Controller Manager) become saturated.
etcd read/write latency grows with request volume.
Network bandwidth and Service‑mesh traffic cause congestion.
Resource fragmentation leaves many nodes under‑utilized while pods remain pending.
Optimization Dimensions
Control‑plane high availability and parameter tuning.
etcd cluster sizing, hardware upgrades, and configuration tweaks.
Network architecture selection and CNI tuning.
Scheduler policy refinement and resource reservation.
Observability, alerting and automated remediation.
Control‑Plane Optimization
API Server Horizontal Scaling and Load Balancing
Deploy multiple API Server instances per master and place a layer‑4 load balancer (HAProxy or Nginx) in front to distribute traffic.
# /etc/haproxy/haproxy.cfg
global
log /dev/log local0
maxconn 50000
nbproc 4
cpu-map auto:1/1-4 0-3
defaults
mode tcp
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
frontend kube-apiserver
bind *:6443
default_backend kube-apiserver-backend
backend kube-apiserver-backend
balance roundrobin
option tcp-check
server master1 10.0.1.10:6443 check inter 2000 rise 2 fall 3
server master2 10.0.1.11:6443 check inter 2000 rise 2 fall 3
server master3 10.0.1.12:6443 check inter 2000 rise 2 fall 3API Server Parameter Tuning
# /etc/kubernetes/manifests/kube-apiserver.yaml (excerpt)
spec:
containers:
- command:
- kube-apiserver
- --max-requests-inflight=2000
- --max-mutating-requests-inflight=1000
- --watch-cache-sizes=nodes#1000,pods#5000,replicasets#1000
- --default-watch-cache-size=500
- --enable-aggregator-routing=true
- --target-ram-mb=8192
- --event-ttl=1h
- --enable-priority-and-fairness=trueKubelet Reporting Frequency
# /var/lib/kubelet/config.yaml
nodeStatusUpdateFrequency: 20s # default 10s
nodeStatusReportFrequency: 5metcd Deep Optimization
Hardware and Deployment Architecture
CPU ≥ 16 cores
Memory ≥ 32 GB
NVMe SSD with IOPS > 10 000
10 GbE dedicated management network
etcd Configuration Parameters
# /etc/etcd/etcd.conf
ETCD_MAX_REQUEST_BYTES=10485760 # 10 MB
ETCD_GRPC_KEEPALIVE_MIN_TIME=5s
ETCD_GRPC_KEEPALIVE_INTERVAL=2h
ETCD_GRPC_KEEPALIVE_TIMEOUT=20s
ETCD_AUTO_COMPACTION_MODE=periodic
ETCD_AUTO_COMPACTION_RETENTION=1hMonitoring and Alerts
# Example curl to fetch metrics
curl https://10.0.2.11:2379/metrics | grep -E "etcd_disk_wal_fsync_duration_seconds|etcd_server_proposals|etcd_network_peer_round_trip"
# Important thresholds:
# - etcd_disk_wal_fsync_duration_seconds < 10 ms
# - etcd_disk_backend_commit_duration_seconds < 25 ms
# - etcd_server_has_leader == true
# - etcd_mvcc_db_total_size_in_bytes < 8 GBScheduler Optimization
Scheduler Parameters
# /etc/kubernetes/manifests/kube-scheduler.yaml (excerpt)
spec:
containers:
- command:
- kube-scheduler
- --kube-api-qps=200
- --kube-api-burst=300
- --bind-address=0.0.0.0
- --leader-elect=true
- --feature-gates=PodTopologySpread=trueNode Affinity & Anti‑Affinity Example
apiVersion: apps/v1
kind: Deployment
metadata:
name: business-app
spec:
replicas: 100
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- business-app
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- maxSkew: 3
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: business-appNetwork Performance Optimization
CNI Plugin Choice (Cilium)
# cilium-config ConfigMap (excerpt)
apiVersion: v1
kind: ConfigMap
metadata:
name: cilium-config
namespace: kube-system
data:
enable-ipv4: "true"
enable-ipv6: "false"
cluster-pool-ipv4-cidr: "10.244.0.0/16"
cluster-pool-ipv4-mask-size: "24"
tunnel: "disabled"
enable-endpoint-routes: "true"
auto-direct-node-routes: "true"
enable-bandwidth-manager: "true"
enable-local-redirect-policy: "true"
kube-proxy-replacement: "strict"
bpf-lb-algorithm: "maglev"
bpf-lb-mode: "dsr"CoreDNS Tuning
# CoreDNS ConfigMap (excerpt)
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health { lameduck 5s }
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
cache 60 {
success 10000 3600
denial 5000 60
}
loop
reload
loadbalance round_robin
forward . /etc/resolv.conf { max_concurrent 1000 }
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: coredns
namespace: kube-system
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
spec:
containers:
- name: coredns
resources:
limits:
memory: 512Mi
requests:
cpu: 500m
memory: 256MiService IPVS Mode
# Edit kube-proxy ConfigMap to enable IPVS
mode: "ipvs"
ipvs:
scheduler: "rr"
syncPeriod: 30s
minSyncPeriod: 5s
strictARP: false
# Restart kube-proxy
default:
kubectl rollout restart daemonset kube-proxy -n kube-system
# Verify rules
ipvsadm -Ln | head -20Practical Case Study
Background
A large internet company expanded from 800 to 1,500 nodes in 2023 and observed severe degradation:
API Server P99 latency grew from ~50 ms to ~2 s.
Pod scheduling time increased from 5 s to 30 s.
etcd database reached 6 GB with frequent leader elections.
Node resource utilization stayed below 30 % while pods could not be scheduled.
Three‑Phase Optimization
Phase 1 – Emergency Fire‑fighting (1 week)
Added a second API Server instance on each master and updated HAProxy backend to balance four instances.
Performed immediate etcd compaction and defragmentation, reducing DB size from 5.8 GB to 2.3 GB and latency from 120 ms to 15 ms.
Adjusted Kubelet nodeStatusUpdateFrequency to 20 s, cutting API Server request volume by ~40 %.
Phase 2 – Systematic Optimization (1 month)
Expanded etcd to five nodes, upgraded storage to NVMe SSDs, and applied performance parameters, achieving write latency ≈ 8 ms and read latency ≈ 3 ms.
Migrated the CNI from Flannel to Cilium with native routing, bandwidth manager and strict kube‑proxy replacement, reducing pod‑to‑pod latency by 35 % and Service latency by 50 %.
Deployed Descheduler with LowNodeUtilization and RemoveDuplicates strategies to eliminate resource fragmentation.
Phase 3 – Continuous Expansion to 5,000 Nodes (3 months)
Built a federated Prometheus stack (one master, five workers) and defined alerts for API latency, etcd fsync, and pending pods.
Implemented automated node‑health scripts and Node Problem Detector for self‑healing.
Results
After the three phases the cluster successfully scaled to 5,000 nodes with the following improvements:
API Server P99 latency reduced from 2,000 ms to 300 ms (≈85 % improvement).
Median pod scheduling time dropped from 30 s to 6 s (≈80 % improvement).
etcd write latency P99 fell from 120 ms to 12 ms (≈90 % improvement).
Average node resource utilization increased from 28 % to 65 %.
Network pod‑to‑pod latency improved from 2.5 ms to 1.6 ms.
Cluster failure incidents fell from three per month to 0.2 per month.
Key Takeaways
Adopt a phased approach: emergency fixes → systematic tuning → ongoing scaling to mitigate risk.
Establish comprehensive monitoring before making changes; data‑driven decisions are essential.
Validate major changes in a test cluster and roll out gradually using blue‑green or canary strategies.
Automate operations; manual interventions do not scale to thousands of nodes.
Plan capacity with at least 30 % headroom for future growth.
Conclusion and Outlook
Control‑plane and etcd performance are the primary bottlenecks for ultra‑large clusters. Network stack, scheduler policies, and observability must be engineered together. Future directions include virtual clusters, edge‑computing extensions, AI‑driven scheduling, and alternative key‑value stores to break current limits.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
