Cloud Native 22 min read

Scaling Kubernetes from 1k to 5k Nodes: Complete Performance Tuning Playbook

This article presents a comprehensive, real‑world guide for expanding a Kubernetes cluster from 1,000 to 5,000 nodes, covering control‑plane HA, etcd optimization, network and scheduler tuning, monitoring, and automation, with detailed configurations, code snippets, and a step‑by‑step case study of a large‑scale production environment.

Raymond Ops
Raymond Ops
Raymond Ops
Scaling Kubernetes from 1k to 5k Nodes: Complete Performance Tuning Playbook

Introduction

Kubernetes is the de‑facto container orchestration platform, but when a cluster grows beyond a thousand nodes performance bottlenecks appear in the control plane, etcd, network and scheduler. This guide presents a systematic, production‑grade methodology for scaling a cluster from 1,000 to 5,000 nodes.

Core Challenges in Large‑Scale Clusters

Control‑plane components (API Server, Scheduler, Controller Manager) become saturated.

etcd read/write latency grows with request volume.

Network bandwidth and Service‑mesh traffic cause congestion.

Resource fragmentation leaves many nodes under‑utilized while pods remain pending.

Optimization Dimensions

Control‑plane high availability and parameter tuning.

etcd cluster sizing, hardware upgrades, and configuration tweaks.

Network architecture selection and CNI tuning.

Scheduler policy refinement and resource reservation.

Observability, alerting and automated remediation.

Control‑Plane Optimization

API Server Horizontal Scaling and Load Balancing

Deploy multiple API Server instances per master and place a layer‑4 load balancer (HAProxy or Nginx) in front to distribute traffic.

# /etc/haproxy/haproxy.cfg
global
    log /dev/log local0
    maxconn 50000
    nbproc 4
    cpu-map auto:1/1-4 0-3

defaults
    mode tcp
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms

frontend kube-apiserver
    bind *:6443
    default_backend kube-apiserver-backend

backend kube-apiserver-backend
    balance roundrobin
    option tcp-check
    server master1 10.0.1.10:6443 check inter 2000 rise 2 fall 3
    server master2 10.0.1.11:6443 check inter 2000 rise 2 fall 3
    server master3 10.0.1.12:6443 check inter 2000 rise 2 fall 3

API Server Parameter Tuning

# /etc/kubernetes/manifests/kube-apiserver.yaml (excerpt)
spec:
  containers:
  - command:
    - kube-apiserver
    - --max-requests-inflight=2000
    - --max-mutating-requests-inflight=1000
    - --watch-cache-sizes=nodes#1000,pods#5000,replicasets#1000
    - --default-watch-cache-size=500
    - --enable-aggregator-routing=true
    - --target-ram-mb=8192
    - --event-ttl=1h
    - --enable-priority-and-fairness=true

Kubelet Reporting Frequency

# /var/lib/kubelet/config.yaml
nodeStatusUpdateFrequency: 20s   # default 10s
nodeStatusReportFrequency: 5m

etcd Deep Optimization

Hardware and Deployment Architecture

CPU ≥ 16 cores

Memory ≥ 32 GB

NVMe SSD with IOPS > 10 000

10 GbE dedicated management network

etcd Configuration Parameters

# /etc/etcd/etcd.conf
ETCD_MAX_REQUEST_BYTES=10485760   # 10 MB
ETCD_GRPC_KEEPALIVE_MIN_TIME=5s
ETCD_GRPC_KEEPALIVE_INTERVAL=2h
ETCD_GRPC_KEEPALIVE_TIMEOUT=20s
ETCD_AUTO_COMPACTION_MODE=periodic
ETCD_AUTO_COMPACTION_RETENTION=1h

Monitoring and Alerts

# Example curl to fetch metrics
curl https://10.0.2.11:2379/metrics | grep -E "etcd_disk_wal_fsync_duration_seconds|etcd_server_proposals|etcd_network_peer_round_trip"
# Important thresholds:
# - etcd_disk_wal_fsync_duration_seconds < 10 ms
# - etcd_disk_backend_commit_duration_seconds < 25 ms
# - etcd_server_has_leader == true
# - etcd_mvcc_db_total_size_in_bytes < 8 GB

Scheduler Optimization

Scheduler Parameters

# /etc/kubernetes/manifests/kube-scheduler.yaml (excerpt)
spec:
  containers:
  - command:
    - kube-scheduler
    - --kube-api-qps=200
    - --kube-api-burst=300
    - --bind-address=0.0.0.0
    - --leader-elect=true
    - --feature-gates=PodTopologySpread=true

Node Affinity & Anti‑Affinity Example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: business-app
spec:
  replicas: 100
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - business-app
              topologyKey: kubernetes.io/hostname
      topologySpreadConstraints:
      - maxSkew: 3
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAnyway
        labelSelector:
          matchLabels:
            app: business-app

Network Performance Optimization

CNI Plugin Choice (Cilium)

# cilium-config ConfigMap (excerpt)
apiVersion: v1
kind: ConfigMap
metadata:
  name: cilium-config
  namespace: kube-system
data:
  enable-ipv4: "true"
  enable-ipv6: "false"
  cluster-pool-ipv4-cidr: "10.244.0.0/16"
  cluster-pool-ipv4-mask-size: "24"
  tunnel: "disabled"
  enable-endpoint-routes: "true"
  auto-direct-node-routes: "true"
  enable-bandwidth-manager: "true"
  enable-local-redirect-policy: "true"
  kube-proxy-replacement: "strict"
  bpf-lb-algorithm: "maglev"
  bpf-lb-mode: "dsr"

CoreDNS Tuning

# CoreDNS ConfigMap (excerpt)
apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
      errors
      health { lameduck 5s }
      ready
      kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
        ttl 30
      }
      prometheus :9153
      cache 60 {
        success 10000 3600
        denial 5000 60
      }
      loop
      reload
      loadbalance round_robin
      forward . /etc/resolv.conf { max_concurrent 1000 }
    }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: coredns
  namespace: kube-system
spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  template:
    spec:
      containers:
      - name: coredns
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 500m
            memory: 256Mi

Service IPVS Mode

# Edit kube-proxy ConfigMap to enable IPVS
mode: "ipvs"
ipvs:
  scheduler: "rr"
  syncPeriod: 30s
  minSyncPeriod: 5s
  strictARP: false
# Restart kube-proxy
default:
  kubectl rollout restart daemonset kube-proxy -n kube-system
# Verify rules
ipvsadm -Ln | head -20

Practical Case Study

Background

A large internet company expanded from 800 to 1,500 nodes in 2023 and observed severe degradation:

API Server P99 latency grew from ~50 ms to ~2 s.

Pod scheduling time increased from 5 s to 30 s.

etcd database reached 6 GB with frequent leader elections.

Node resource utilization stayed below 30 % while pods could not be scheduled.

Three‑Phase Optimization

Phase 1 – Emergency Fire‑fighting (1 week)

Added a second API Server instance on each master and updated HAProxy backend to balance four instances.

Performed immediate etcd compaction and defragmentation, reducing DB size from 5.8 GB to 2.3 GB and latency from 120 ms to 15 ms.

Adjusted Kubelet nodeStatusUpdateFrequency to 20 s, cutting API Server request volume by ~40 %.

Phase 2 – Systematic Optimization (1 month)

Expanded etcd to five nodes, upgraded storage to NVMe SSDs, and applied performance parameters, achieving write latency ≈ 8 ms and read latency ≈ 3 ms.

Migrated the CNI from Flannel to Cilium with native routing, bandwidth manager and strict kube‑proxy replacement, reducing pod‑to‑pod latency by 35 % and Service latency by 50 %.

Deployed Descheduler with LowNodeUtilization and RemoveDuplicates strategies to eliminate resource fragmentation.

Phase 3 – Continuous Expansion to 5,000 Nodes (3 months)

Built a federated Prometheus stack (one master, five workers) and defined alerts for API latency, etcd fsync, and pending pods.

Implemented automated node‑health scripts and Node Problem Detector for self‑healing.

Results

After the three phases the cluster successfully scaled to 5,000 nodes with the following improvements:

API Server P99 latency reduced from 2,000 ms to 300 ms (≈85 % improvement).

Median pod scheduling time dropped from 30 s to 6 s (≈80 % improvement).

etcd write latency P99 fell from 120 ms to 12 ms (≈90 % improvement).

Average node resource utilization increased from 28 % to 65 %.

Network pod‑to‑pod latency improved from 2.5 ms to 1.6 ms.

Cluster failure incidents fell from three per month to 0.2 per month.

Key Takeaways

Adopt a phased approach: emergency fixes → systematic tuning → ongoing scaling to mitigate risk.

Establish comprehensive monitoring before making changes; data‑driven decisions are essential.

Validate major changes in a test cluster and roll out gradually using blue‑green or canary strategies.

Automate operations; manual interventions do not scale to thousands of nodes.

Plan capacity with at least 30 % headroom for future growth.

Conclusion and Outlook

Control‑plane and etcd performance are the primary bottlenecks for ultra‑large clusters. Network stack, scheduler policies, and observability must be engineered together. Future directions include virtual clusters, edge‑computing extensions, AI‑driven scheduling, and alternative key‑value stores to break current limits.

Performance Tuningcluster scalingetcdCNIcontrol plane
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.