Cloud Native 27 min read

Scaling Kubernetes from 1,000 to 5,000 Nodes: Real‑World Performance Tuning Guide

This article details a step‑by‑step, production‑grade guide for expanding a Kubernetes cluster from 1,000 to 5,000 nodes, covering control‑plane HA, etcd tuning, network and scheduler optimizations, monitoring, and real‑world case studies to achieve stable, high‑performance large‑scale deployments.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Scaling Kubernetes from 1,000 to 5,000 Nodes: Real‑World Performance Tuning Guide

Scaling Kubernetes from 1,000 to 5,000 Nodes: Full Performance Tuning Record

Introduction

In the cloud‑native era, Kubernetes has become the de‑facto standard for container orchestration. As business scales from hundreds to thousands of nodes, operations teams encounter performance bottlenecks, resource waste, and scheduling delays. This article shares practical experience on expanding a Kubernetes cluster from 1,000 to 5,000 nodes, covering control‑plane optimization, etcd tuning, network performance, and scheduler improvements.

Technical Background: Core Challenges of Large‑Scale Kubernetes Clusters

Kubernetes Architecture Scalability Bottlenecks

Kubernetes was designed for distributed operation, but in ultra‑large clusters the control plane becomes a performance choke point. When the cluster exceeds 1,000 nodes, the API Server must handle massive requests from kubelets, controllers, schedulers, and users, while etcd’s read/write speed directly affects overall responsiveness.

Typical Performance Issues in Large Clusters

API response slowdown : kubectl latency grows from milliseconds to seconds.

Increased scheduling delay : new pod scheduling time expands from seconds to minutes.

etcd storage pressure : many watch requests cause CPU and memory usage to climb.

Network bandwidth bottleneck : traffic from services, service mesh, and log collection leads to congestion.

Severe resource fragmentation : overall cluster resources appear sufficient, yet individual nodes cannot schedule pods.

Key Dimensions for Optimization

To achieve a smooth expansion from 1,000 to 5,000 nodes, systematic optimization is required in the following areas:

Control‑plane high availability and performance tuning.

etcd cluster optimization (capacity planning, performance tuning, backup & recovery).

Network architecture optimization (CNI selection, service‑mesh lightening, traffic control).

Scheduler strategy refinement (custom scheduler, resource reservation, pod priority & preemption).

Monitoring and observability for large‑scale clusters.

Core Content: Practical Kubernetes Cluster Performance Tuning

1. Control‑Plane Optimization: Breaking API Server Bottlenecks

1.1 API Server Horizontal Scaling and Load Balancing

In a large cluster a single API Server instance cannot handle all requests. Deploy multiple API Server instances behind a load balancer to distribute load.

3 master nodes, each running 2 API Server static Pods.

Use HAProxy or Nginx as a layer‑4 load balancer.

Configure health checks and automatic failover.

HAProxy Configuration Example

# /etc/haproxy/haproxy.cfg
global
    log /dev/log local0
    maxconn 50000
    nbproc 4
    cpu-map auto:1/1-4 0-3

defaults
    mode tcp
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

frontend kube-apiserver
    bind *:6443
    default_backend kube-apiserver-backend

backend kube-apiserver-backend
    balance roundrobin
    option tcp-check
    server master1 10.0.1.10:6443 check inter 2000 rise 2 fall 3
    server master2 10.0.1.11:6443 check inter 2000 rise 2 fall 3
    server master3 10.0.1.12:6443 check inter 2000 rise 2 fall 3

1.2 API Server Critical Parameter Tuning

# Modify /etc/kubernetes/manifests/kube-apiserver.yaml
spec:
  containers:
  - command:
    - kube-apiserver
    - --max-requests-inflight=2000      # increase concurrent requests (default 400)
    - --max-mutating-requests-inflight=1000  # increase write request concurrency (default 200)
    - --watch-cache-sizes=nodes#1000,pods#5000,replicasets#1000
    - --default-watch-cache-size=500   # increase watch cache (default 100)
    - --enable-aggregator-routing=true
    - --target-ram-mb=8192
    - --event-ttl=1h
    - --enable-priority-and-fairness=true

1.3 Tuning Kubelet‑API Server Communication Frequency

# Adjust on each node: /var/lib/kubelet/config.yaml
nodeStatusUpdateFrequency: 20s   # extend from 10s to 20s
nodeStatusReportFrequency: 5m

2. etcd Deep Optimization

2.1 etcd Hardware Configuration and Deployment Architecture

etcd is the core data store of Kubernetes; its performance directly determines cluster stability. Recommended hardware for a 5,000‑node cluster:

CPU: 16 cores or more

Memory: 32 GB or more

Storage: NVMe SSD with IOPS > 10,000

Network: 10 GbE dedicated management network

Deployment Architecture

# Deploy a 5‑node etcd cluster (odd number recommended)
# etcd1: 10.0.2.11
# etcd2: 10.0.2.12
# etcd3: 10.0.2.13
# etcd4: 10.0.2.14
# etcd5: 10.0.2.15

etcd --name etcd1 \
  --data-dir /var/lib/etcd \
  --listen-peer-urls https://10.0.2.11:2380 \
  --listen-client-urls https://10.0.2.11:2379,https://127.0.0.1:2379 \
  --initial-advertise-peer-urls https://10.0.2.11:2380 \
  --advertise-client-urls https://10.0.2.11:2379 \
  --initial-cluster-token etcd-cluster-prod \
  --initial-cluster etcd1=https://10.0.2.11:2380,etcd2=https://10.0.2.12:2380,etcd3=https://10.0.2.13:2380,etcd4=https://10.0.2.14:2380,etcd5=https://10.0.2.15:2380 \
  --initial-cluster-state new \
  --heartbeat-interval 200 \
  --election-timeout 2000 \
  --snapshot-count 20000 \
  --max-snapshots 5 \
  --max-wals 5 \
  --quota-backend-bytes 8589934592   # 8 GB quota

2.2 etcd Performance Tuning Parameters

# Optimize etcd configuration (/etc/etcd/etcd.conf)
ETCD_MAX_REQUEST_BYTES=10485760          # 10 MB request size limit
ETCD_GRPC_KEEPALIVE_MIN_TIME=5s
ETCD_GRPC_KEEPALIVE_INTERVAL=2h
ETCD_GRPC_KEEPALIVE_TIMEOUT=20s

# Snapshot and compaction
ETCD_AUTO_COMPACTION_MODE=periodic
ETCD_AUTO_COMPACTION_RETENTION=1h

# Manual compaction when DB grows large
etcdctl compact $(etcdctl endpoint status --write-out="json" | jq -r '.[] | .Status.header.revision')

# Defragmentation
etcdctl defrag --cluster

2.3 etcd Monitoring and Alerting

# Key metrics collection
curl https://10.0.2.11:2379/metrics | grep -E "etcd_disk_wal_fsync_duration_seconds|etcd_server_proposals|etcd_network_peer_round_trip"

# Important alerts (example thresholds)
# - etcd_disk_wal_fsync_duration_seconds < 10 ms
# - etcd_disk_backend_commit_duration_seconds < 25 ms
# - etcd_server_has_leader == true
# - etcd_mvcc_db_total_size_in_bytes < 8 GB

3. Scheduler Optimization: Improving Pod Scheduling Efficiency

3.1 Scheduler Performance Parameter Tuning

# Modify /etc/kubernetes/manifests/kube-scheduler.yaml
spec:
  containers:
  - command:
    - kube-scheduler
    - --kube-api-qps=200          # increase API request QPS (default 50)
    - --kube-api-burst=300        # increase burst (default 100)
    - --bind-address=0.0.0.0
    - --leader-elect=true
    - --feature-gates=PodTopologySpread=true

3.2 Configuring Node Affinity and Anti‑Affinity

apiVersion: apps/v1
kind: Deployment
metadata:
  name: business-app
spec:
  replicas: 100
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - business-app
              topologyKey: kubernetes.io/hostname
      topologySpreadConstraints:
      - maxSkew: 3
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: business-app

3.3 Configuring Resource Reservations and Limits

# Update kubelet config (/var/lib/kubelet/config.yaml)
systemReserved:
  cpu: 1000m
  memory: 2Gi
  ephemeral-storage: 10Gi
kubeReserved:
  cpu: 1000m
  memory: 2Gi
  ephemeral-storage: 10Gi
evictionHard:
  memory.available: "1Gi"
  nodefs.available: "10%"
  imagefs.available: "10%"

4. Network Performance Optimization

4.1 CNI Plugin Selection and Optimization

For large clusters, Cilium (eBPF‑based) or Calico (IPIP/VXLAN) are recommended.

Cilium Optimization Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:
  enable-ipv4: "true"
  enable-ipv6: "false"
  cluster-pool-ipv4-cidr: "10.244.0.0/16"
  cluster-pool-ipv4-mask-size: "24"
  tunnel: "disabled"
  enable-endpoint-routes: "true"
  auto-direct-node-routes: "true"
  enable-bandwidth-manager: "true"
  enable-local-redirect-policy: "true"
  kube-proxy-replacement: "strict"
  bpf-lb-algorithm: "maglev"
  bpf-lb-mode: "dsr"

4.2 CoreDNS Configuration Optimization

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health { lameduck 5s }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
            ttl 30
        }
        prometheus :9153
        cache 60 {
            success 10000 3600
            denial 5000 60
        }
        loop
        reload
        loadbalance round_robin
        forward . /etc/resolv.conf {
            max_concurrent 1000
        }
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: coredns
  namespace: kube-system
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    spec:
      containers:
      - name: coredns
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 500m
            memory: 256Mi

4.3 Service Traffic Optimization

# Switch kube-proxy to IPVS mode
kubectl edit configmap kube-proxy -n kube-system
# Set mode: "ipvs" and scheduler: "rr"
# Restart kube-proxy
kubectl rollout restart daemonset kube-proxy -n kube-system
# Verify IPVS rules
ipvsadm -Ln | head -20

Practical Case: 5,000‑Node Cluster Optimization Full Record

Case Background

API Server P99 latency grew from 50 ms to 2 s.

Pod scheduling time increased from 5 s to 30 s.

etcd database reached 6 GB with frequent leader elections.

Node resource utilization stayed below 30 % while pods remained pending.

Optimization Implementation Plan

Phase 1: Emergency Fire‑Fighting (1 week)

Control‑plane expansion

# Original: 3 masters, 1 API Server each
# Optimized: 3 masters, 2 API Server instances per master
cat > /etc/kubernetes/manifests/kube-apiserver-2.yaml <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: kube-apiserver-2
  namespace: kube-system
spec:
  hostNetwork: true
  containers:
  - name: kube-apiserver
    image: registry.k8s.io/kube-apiserver:v1.28.4
    command:
    - kube-apiserver
    - --advertise-address=10.0.1.10
    - --secure-port=6444
    - --max-requests-inflight=2000
    - --max-mutating-requests-inflight=1000
EOF
# Update HAProxy backend
server master1-2 10.0.1.10:6444 check
server master2-2 10.0.1.11:6444 check
server master3-2 10.0.1.12:6444 check

etcd emergency compression & defragmentation

# Check size
etcdctl endpoint status --write-out=table
# Compact
REVISION=$(etcdctl endpoint status --write-out="json" | jq -r '.[] | .Status.header.revision')
etcdctl compact $REVISION
# Defragment each node
for endpoint in 10.0.2.11:2379 10.0.2.12:2379 10.0.2.13:2379; do
  etcdctl defrag --endpoints=$endpoint
  sleep 60
done

Adjust Kubelet reporting frequency

# Batch modify kubelet config
ansible k8s-nodes -m lineinfile -a "path=/var/lib/kubelet/config.yaml regexp='^nodeStatusUpdateFrequency' line='nodeStatusUpdateFrequency: 20s'"
# Rolling restart
ansible k8s-nodes -m systemd -a "name=kubelet state=restarted" --limit 'batch1'

Phase 2: Systematic Optimization (1 month)

etcd cluster expansion & hardware upgrade

# Add two new members
etcdctl member add etcd4 --peer-urls=https://10.0.2.14:2380
etcdctl member add etcd5 --peer-urls=https://10.0.2.15:2380
# Start etcd on new nodes with existing cluster state
etcd --name etcd4 \
  --data-dir /var/lib/etcd \
  --initial-cluster-state existing \
  --initial-cluster etcd1=https://10.0.2.11:2380,etcd2=https://10.0.2.12:2380,etcd3=https://10.0.2.13:2380,etcd4=https://10.0.2.14:2380,etcd5=https://10.0.2.15:2380

Network migration from Flannel to Cilium

helm install cilium cilium/cilium --version 1.14.5 \
  --namespace kube-system \
  --set tunnel=disabled \
  --set autoDirectNodeRoutes=true \
  --set kubeProxyReplacement=strict \
  --set bpf.masquerade=true

Scheduler descheduler deployment

apiVersion: v1
kind: ConfigMap
metadata:
  name: descheduler-policy
  namespace: kube-system
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    strategies:
      RemoveDuplicates:
        enabled: true
      LowNodeUtilization:
        enabled: true
        params:
          nodeResourceUtilizationThresholds:
            thresholds:
              cpu: 30
              memory: 30
              pods: 30
            targetThresholds:
              cpu: 60
              memory: 60
              pods: 60
      RemovePodsViolatingNodeAffinity:
        enabled: true
      RemovePodsViolatingInterPodAntiAffinity:
        enabled: true
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: descheduler
  namespace: kube-system
spec:
  schedule: "0 */6 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: descheduler
            image: registry.k8s.io/descheduler/descheduler:v0.29.0
            command:
            - /bin/descheduler
            - --policy-config-file=/policy/policy.yaml
            - --v=3
            volumeMounts:
            - name: policy
              mountPath: /policy
          volumes:
          - name: policy
            configMap:
              name: descheduler-policy
          restartPolicy: Never

Phase 3: Continuous Expansion to 5,000 Nodes (3 months)

Monitoring system construction

# Deploy Prometheus federation
# Master Prometheus scrapes control‑plane metrics
# Worker Prometheus (5 instances) each collects ~1,000 node metrics
scrape_configs:
- job_name: 'federate'
  scrape_interval: 30s
  honor_labels: true
  metrics_path: '/federate'
  params:
    'match[]':
    - '{job=~"kubernetes-.*"}'
    - '{__name__=~"node_.*"}'
  static_configs:
  - targets:
    - 'prometheus-worker-1:9090'
    - 'prometheus-worker-2:9090'
    - 'prometheus-worker-3:9090'
    - 'prometheus-worker-4:9090'
    - 'prometheus-worker-5:9090'

# Alert rules example
groups:
- name: k8s-cluster
  rules:
  - alert: APIServerHighLatency
    expr: histogram_quantile(0.99, apiserver_request_duration_seconds_bucket) > 3
    for: 5m
    annotations:
      summary: "API Server response latency too high"
  - alert: EtcdHighFsyncDuration
    expr: histogram_quantile(0.99, etcd_disk_wal_fsync_duration_seconds_bucket) > 0.1
    for: 5m
    annotations:
      summary: "etcd disk fsync latency too high"
  - alert: SchedulerPendingPods
    expr: scheduler_pending_pods > 100
    for: 10m
    annotations:
      summary: "Too many pods pending scheduling"

Automated operation toolchain

# Node health‑check script (example)
cat > /usr/local/bin/node-health-check.sh <<'EOF'
#!/bin/bash
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
MEM_USAGE=$(free | grep Mem | awk '{print ($3/$2) * 100.0}')
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | cut -d'%' -f1)
if (( $(echo "$CPU_USAGE > 90" | bc -l) )); then echo "HIGH_CPU"; exit 1; fi
if (( $(echo "$MEM_USAGE > 85" | bc -l) )); then echo "HIGH_MEMORY"; exit 1; fi
if [ $DISK_USAGE -gt 85 ]; then echo "HIGH_DISK"; exit 1; fi
systemctl is-active --quiet kubelet || { echo "KUBELET_DOWN"; exit 1; }
echo "HEALTHY"
EOF
chmod +x /usr/local/bin/node-health-check.sh
# Deploy Node Problem Detector
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml

Key Experience Summary

Phase‑wise implementation: emergency fire‑fighting → systematic optimization → continuous expansion reduces risk.

Monitoring first: a complete monitoring stack drives data‑driven decisions.

Gray‑box verification: test major changes in a staging cluster before rolling out.

Automation is essential: at 5,000 nodes manual operations become infeasible.

Capacity planning: reserve ~30 % resource headroom six months ahead.

Conclusion and Outlook

Through three phases of systematic optimization, the cluster successfully scaled to 5,000 nodes with significant improvements in API Server latency, pod scheduling time, etcd write latency, node utilization, and network latency. Future directions include virtual cluster technologies, edge‑computing architectures, AI‑driven intelligent scheduling, and exploring etcd alternatives such as KineDB to break storage bottlenecks.

For operations teams, continuous learning, deep understanding of Kubernetes internals, and disciplined practice are the keys to mastering large‑scale cluster management.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesSchedulerperformance tuningcluster scalingetcdControl Plane
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.