Scaling Kubernetes from 1,000 to 5,000 Nodes: Real‑World Performance Tuning Guide
This article details a step‑by‑step, production‑grade guide for expanding a Kubernetes cluster from 1,000 to 5,000 nodes, covering control‑plane HA, etcd tuning, network and scheduler optimizations, monitoring, and real‑world case studies to achieve stable, high‑performance large‑scale deployments.
Scaling Kubernetes from 1,000 to 5,000 Nodes: Full Performance Tuning Record
Introduction
In the cloud‑native era, Kubernetes has become the de‑facto standard for container orchestration. As business scales from hundreds to thousands of nodes, operations teams encounter performance bottlenecks, resource waste, and scheduling delays. This article shares practical experience on expanding a Kubernetes cluster from 1,000 to 5,000 nodes, covering control‑plane optimization, etcd tuning, network performance, and scheduler improvements.
Technical Background: Core Challenges of Large‑Scale Kubernetes Clusters
Kubernetes Architecture Scalability Bottlenecks
Kubernetes was designed for distributed operation, but in ultra‑large clusters the control plane becomes a performance choke point. When the cluster exceeds 1,000 nodes, the API Server must handle massive requests from kubelets, controllers, schedulers, and users, while etcd’s read/write speed directly affects overall responsiveness.
Typical Performance Issues in Large Clusters
API response slowdown : kubectl latency grows from milliseconds to seconds.
Increased scheduling delay : new pod scheduling time expands from seconds to minutes.
etcd storage pressure : many watch requests cause CPU and memory usage to climb.
Network bandwidth bottleneck : traffic from services, service mesh, and log collection leads to congestion.
Severe resource fragmentation : overall cluster resources appear sufficient, yet individual nodes cannot schedule pods.
Key Dimensions for Optimization
To achieve a smooth expansion from 1,000 to 5,000 nodes, systematic optimization is required in the following areas:
Control‑plane high availability and performance tuning.
etcd cluster optimization (capacity planning, performance tuning, backup & recovery).
Network architecture optimization (CNI selection, service‑mesh lightening, traffic control).
Scheduler strategy refinement (custom scheduler, resource reservation, pod priority & preemption).
Monitoring and observability for large‑scale clusters.
Core Content: Practical Kubernetes Cluster Performance Tuning
1. Control‑Plane Optimization: Breaking API Server Bottlenecks
1.1 API Server Horizontal Scaling and Load Balancing
In a large cluster a single API Server instance cannot handle all requests. Deploy multiple API Server instances behind a load balancer to distribute load.
3 master nodes, each running 2 API Server static Pods.
Use HAProxy or Nginx as a layer‑4 load balancer.
Configure health checks and automatic failover.
HAProxy Configuration Example
# /etc/haproxy/haproxy.cfg
global
log /dev/log local0
maxconn 50000
nbproc 4
cpu-map auto:1/1-4 0-3
defaults
mode tcp
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
frontend kube-apiserver
bind *:6443
default_backend kube-apiserver-backend
backend kube-apiserver-backend
balance roundrobin
option tcp-check
server master1 10.0.1.10:6443 check inter 2000 rise 2 fall 3
server master2 10.0.1.11:6443 check inter 2000 rise 2 fall 3
server master3 10.0.1.12:6443 check inter 2000 rise 2 fall 31.2 API Server Critical Parameter Tuning
# Modify /etc/kubernetes/manifests/kube-apiserver.yaml
spec:
containers:
- command:
- kube-apiserver
- --max-requests-inflight=2000 # increase concurrent requests (default 400)
- --max-mutating-requests-inflight=1000 # increase write request concurrency (default 200)
- --watch-cache-sizes=nodes#1000,pods#5000,replicasets#1000
- --default-watch-cache-size=500 # increase watch cache (default 100)
- --enable-aggregator-routing=true
- --target-ram-mb=8192
- --event-ttl=1h
- --enable-priority-and-fairness=true1.3 Tuning Kubelet‑API Server Communication Frequency
# Adjust on each node: /var/lib/kubelet/config.yaml
nodeStatusUpdateFrequency: 20s # extend from 10s to 20s
nodeStatusReportFrequency: 5m2. etcd Deep Optimization
2.1 etcd Hardware Configuration and Deployment Architecture
etcd is the core data store of Kubernetes; its performance directly determines cluster stability. Recommended hardware for a 5,000‑node cluster:
CPU: 16 cores or more
Memory: 32 GB or more
Storage: NVMe SSD with IOPS > 10,000
Network: 10 GbE dedicated management network
Deployment Architecture
# Deploy a 5‑node etcd cluster (odd number recommended)
# etcd1: 10.0.2.11
# etcd2: 10.0.2.12
# etcd3: 10.0.2.13
# etcd4: 10.0.2.14
# etcd5: 10.0.2.15
etcd --name etcd1 \
--data-dir /var/lib/etcd \
--listen-peer-urls https://10.0.2.11:2380 \
--listen-client-urls https://10.0.2.11:2379,https://127.0.0.1:2379 \
--initial-advertise-peer-urls https://10.0.2.11:2380 \
--advertise-client-urls https://10.0.2.11:2379 \
--initial-cluster-token etcd-cluster-prod \
--initial-cluster etcd1=https://10.0.2.11:2380,etcd2=https://10.0.2.12:2380,etcd3=https://10.0.2.13:2380,etcd4=https://10.0.2.14:2380,etcd5=https://10.0.2.15:2380 \
--initial-cluster-state new \
--heartbeat-interval 200 \
--election-timeout 2000 \
--snapshot-count 20000 \
--max-snapshots 5 \
--max-wals 5 \
--quota-backend-bytes 8589934592 # 8 GB quota2.2 etcd Performance Tuning Parameters
# Optimize etcd configuration (/etc/etcd/etcd.conf)
ETCD_MAX_REQUEST_BYTES=10485760 # 10 MB request size limit
ETCD_GRPC_KEEPALIVE_MIN_TIME=5s
ETCD_GRPC_KEEPALIVE_INTERVAL=2h
ETCD_GRPC_KEEPALIVE_TIMEOUT=20s
# Snapshot and compaction
ETCD_AUTO_COMPACTION_MODE=periodic
ETCD_AUTO_COMPACTION_RETENTION=1h
# Manual compaction when DB grows large
etcdctl compact $(etcdctl endpoint status --write-out="json" | jq -r '.[] | .Status.header.revision')
# Defragmentation
etcdctl defrag --cluster2.3 etcd Monitoring and Alerting
# Key metrics collection
curl https://10.0.2.11:2379/metrics | grep -E "etcd_disk_wal_fsync_duration_seconds|etcd_server_proposals|etcd_network_peer_round_trip"
# Important alerts (example thresholds)
# - etcd_disk_wal_fsync_duration_seconds < 10 ms
# - etcd_disk_backend_commit_duration_seconds < 25 ms
# - etcd_server_has_leader == true
# - etcd_mvcc_db_total_size_in_bytes < 8 GB3. Scheduler Optimization: Improving Pod Scheduling Efficiency
3.1 Scheduler Performance Parameter Tuning
# Modify /etc/kubernetes/manifests/kube-scheduler.yaml
spec:
containers:
- command:
- kube-scheduler
- --kube-api-qps=200 # increase API request QPS (default 50)
- --kube-api-burst=300 # increase burst (default 100)
- --bind-address=0.0.0.0
- --leader-elect=true
- --feature-gates=PodTopologySpread=true3.2 Configuring Node Affinity and Anti‑Affinity
apiVersion: apps/v1
kind: Deployment
metadata:
name: business-app
spec:
replicas: 100
template:
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- business-app
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- maxSkew: 3
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: business-app3.3 Configuring Resource Reservations and Limits
# Update kubelet config (/var/lib/kubelet/config.yaml)
systemReserved:
cpu: 1000m
memory: 2Gi
ephemeral-storage: 10Gi
kubeReserved:
cpu: 1000m
memory: 2Gi
ephemeral-storage: 10Gi
evictionHard:
memory.available: "1Gi"
nodefs.available: "10%"
imagefs.available: "10%"4. Network Performance Optimization
4.1 CNI Plugin Selection and Optimization
For large clusters, Cilium (eBPF‑based) or Calico (IPIP/VXLAN) are recommended.
Cilium Optimization Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: cilium-config
namespace: kube-system
data:
enable-ipv4: "true"
enable-ipv6: "false"
cluster-pool-ipv4-cidr: "10.244.0.0/16"
cluster-pool-ipv4-mask-size: "24"
tunnel: "disabled"
enable-endpoint-routes: "true"
auto-direct-node-routes: "true"
enable-bandwidth-manager: "true"
enable-local-redirect-policy: "true"
kube-proxy-replacement: "strict"
bpf-lb-algorithm: "maglev"
bpf-lb-mode: "dsr"4.2 CoreDNS Configuration Optimization
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health { lameduck 5s }
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
cache 60 {
success 10000 3600
denial 5000 60
}
loop
reload
loadbalance round_robin
forward . /etc/resolv.conf {
max_concurrent 1000
}
}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: coredns
namespace: kube-system
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
spec:
containers:
- name: coredns
resources:
limits:
memory: 512Mi
requests:
cpu: 500m
memory: 256Mi4.3 Service Traffic Optimization
# Switch kube-proxy to IPVS mode
kubectl edit configmap kube-proxy -n kube-system
# Set mode: "ipvs" and scheduler: "rr"
# Restart kube-proxy
kubectl rollout restart daemonset kube-proxy -n kube-system
# Verify IPVS rules
ipvsadm -Ln | head -20Practical Case: 5,000‑Node Cluster Optimization Full Record
Case Background
API Server P99 latency grew from 50 ms to 2 s.
Pod scheduling time increased from 5 s to 30 s.
etcd database reached 6 GB with frequent leader elections.
Node resource utilization stayed below 30 % while pods remained pending.
Optimization Implementation Plan
Phase 1: Emergency Fire‑Fighting (1 week)
Control‑plane expansion
# Original: 3 masters, 1 API Server each
# Optimized: 3 masters, 2 API Server instances per master
cat > /etc/kubernetes/manifests/kube-apiserver-2.yaml <<EOF
apiVersion: v1
kind: Pod
metadata:
name: kube-apiserver-2
namespace: kube-system
spec:
hostNetwork: true
containers:
- name: kube-apiserver
image: registry.k8s.io/kube-apiserver:v1.28.4
command:
- kube-apiserver
- --advertise-address=10.0.1.10
- --secure-port=6444
- --max-requests-inflight=2000
- --max-mutating-requests-inflight=1000
EOF
# Update HAProxy backend
server master1-2 10.0.1.10:6444 check
server master2-2 10.0.1.11:6444 check
server master3-2 10.0.1.12:6444 checketcd emergency compression & defragmentation
# Check size
etcdctl endpoint status --write-out=table
# Compact
REVISION=$(etcdctl endpoint status --write-out="json" | jq -r '.[] | .Status.header.revision')
etcdctl compact $REVISION
# Defragment each node
for endpoint in 10.0.2.11:2379 10.0.2.12:2379 10.0.2.13:2379; do
etcdctl defrag --endpoints=$endpoint
sleep 60
doneAdjust Kubelet reporting frequency
# Batch modify kubelet config
ansible k8s-nodes -m lineinfile -a "path=/var/lib/kubelet/config.yaml regexp='^nodeStatusUpdateFrequency' line='nodeStatusUpdateFrequency: 20s'"
# Rolling restart
ansible k8s-nodes -m systemd -a "name=kubelet state=restarted" --limit 'batch1'Phase 2: Systematic Optimization (1 month)
etcd cluster expansion & hardware upgrade
# Add two new members
etcdctl member add etcd4 --peer-urls=https://10.0.2.14:2380
etcdctl member add etcd5 --peer-urls=https://10.0.2.15:2380
# Start etcd on new nodes with existing cluster state
etcd --name etcd4 \
--data-dir /var/lib/etcd \
--initial-cluster-state existing \
--initial-cluster etcd1=https://10.0.2.11:2380,etcd2=https://10.0.2.12:2380,etcd3=https://10.0.2.13:2380,etcd4=https://10.0.2.14:2380,etcd5=https://10.0.2.15:2380Network migration from Flannel to Cilium
helm install cilium cilium/cilium --version 1.14.5 \
--namespace kube-system \
--set tunnel=disabled \
--set autoDirectNodeRoutes=true \
--set kubeProxyReplacement=strict \
--set bpf.masquerade=trueScheduler descheduler deployment
apiVersion: v1
kind: ConfigMap
metadata:
name: descheduler-policy
namespace: kube-system
data:
policy.yaml: |
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
RemoveDuplicates:
enabled: true
LowNodeUtilization:
enabled: true
params:
nodeResourceUtilizationThresholds:
thresholds:
cpu: 30
memory: 30
pods: 30
targetThresholds:
cpu: 60
memory: 60
pods: 60
RemovePodsViolatingNodeAffinity:
enabled: true
RemovePodsViolatingInterPodAntiAffinity:
enabled: true
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: descheduler
namespace: kube-system
spec:
schedule: "0 */6 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: descheduler
image: registry.k8s.io/descheduler/descheduler:v0.29.0
command:
- /bin/descheduler
- --policy-config-file=/policy/policy.yaml
- --v=3
volumeMounts:
- name: policy
mountPath: /policy
volumes:
- name: policy
configMap:
name: descheduler-policy
restartPolicy: NeverPhase 3: Continuous Expansion to 5,000 Nodes (3 months)
Monitoring system construction
# Deploy Prometheus federation
# Master Prometheus scrapes control‑plane metrics
# Worker Prometheus (5 instances) each collects ~1,000 node metrics
scrape_configs:
- job_name: 'federate'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~"kubernetes-.*"}'
- '{__name__=~"node_.*"}'
static_configs:
- targets:
- 'prometheus-worker-1:9090'
- 'prometheus-worker-2:9090'
- 'prometheus-worker-3:9090'
- 'prometheus-worker-4:9090'
- 'prometheus-worker-5:9090'
# Alert rules example
groups:
- name: k8s-cluster
rules:
- alert: APIServerHighLatency
expr: histogram_quantile(0.99, apiserver_request_duration_seconds_bucket) > 3
for: 5m
annotations:
summary: "API Server response latency too high"
- alert: EtcdHighFsyncDuration
expr: histogram_quantile(0.99, etcd_disk_wal_fsync_duration_seconds_bucket) > 0.1
for: 5m
annotations:
summary: "etcd disk fsync latency too high"
- alert: SchedulerPendingPods
expr: scheduler_pending_pods > 100
for: 10m
annotations:
summary: "Too many pods pending scheduling"Automated operation toolchain
# Node health‑check script (example)
cat > /usr/local/bin/node-health-check.sh <<'EOF'
#!/bin/bash
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
MEM_USAGE=$(free | grep Mem | awk '{print ($3/$2) * 100.0}')
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | cut -d'%' -f1)
if (( $(echo "$CPU_USAGE > 90" | bc -l) )); then echo "HIGH_CPU"; exit 1; fi
if (( $(echo "$MEM_USAGE > 85" | bc -l) )); then echo "HIGH_MEMORY"; exit 1; fi
if [ $DISK_USAGE -gt 85 ]; then echo "HIGH_DISK"; exit 1; fi
systemctl is-active --quiet kubelet || { echo "KUBELET_DOWN"; exit 1; }
echo "HEALTHY"
EOF
chmod +x /usr/local/bin/node-health-check.sh
# Deploy Node Problem Detector
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yamlKey Experience Summary
Phase‑wise implementation: emergency fire‑fighting → systematic optimization → continuous expansion reduces risk.
Monitoring first: a complete monitoring stack drives data‑driven decisions.
Gray‑box verification: test major changes in a staging cluster before rolling out.
Automation is essential: at 5,000 nodes manual operations become infeasible.
Capacity planning: reserve ~30 % resource headroom six months ahead.
Conclusion and Outlook
Through three phases of systematic optimization, the cluster successfully scaled to 5,000 nodes with significant improvements in API Server latency, pod scheduling time, etcd write latency, node utilization, and network latency. Future directions include virtual cluster technologies, edge‑computing architectures, AI‑driven intelligent scheduling, and exploring etcd alternatives such as KineDB to break storage bottlenecks.
For operations teams, continuous learning, deep understanding of Kubernetes internals, and disciplined practice are the keys to mastering large‑scale cluster management.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
