How to Scale Kubernetes Clusters: Quotas, Kernel Tweaks, and Best Practices
This guide outlines essential steps for scaling large Kubernetes clusters on public clouds, covering node quota adjustments, kernel parameter tuning, etcd high‑availability setup, API server and pod configurations, and best‑practice recommendations to ensure stable performance as node counts grow.
1. Node Quotas and Kernel Parameter Adjustments
When scaling Kubernetes clusters on public clouds, you may encounter quota limits and need to increase them on the cloud platform. Quotas to increase include:
Number of virtual machines
Number of vCPUs
Number of internal IP addresses
Number of external IP addresses
Number of security groups
Number of route tables
Persistent storage size
Reference GCE master node configurations based on node count:
1-5 nodes: n1-standard-1
6-10 nodes: n1-standard-2
11-100 nodes: n1-standard-4
101-250 nodes: n1-standard-8
251-500 nodes: n1-standard-16
more than 500 nodes: n1-standard-32
Reference Alibaba Cloud configuration (kernel parameters):
# max-file sets the maximum number of file handles the system can open
fs.file-max=1000000
# ARP cache size
net.ipv4.neigh.default.gc_thresh1=1024
net.ipv4.neigh.default.gc_thresh2=4096
net.ipv4.neigh.default.gc_thresh3=8192
# Netfilter connection tracking limits
net.netfilter.nf_conntrack_max=10485760
net.core.netdev_max_backlog=10000
net.netfilter.nf_conntrack_tcp_timeout_established=300
net.netfilter.nf_conntrack_buckets=655360
# Inotify limits
fs.inotify.max_user_instances=524288
fs.inotify.max_user_watches=5242882. Etcd Database
Deploy a highly available etcd cluster that can automatically scale; the common solution is to use the etcd operator, which simplifies management of stateful applications.
Key features of the etcd operator:
create/destroy: automatically provision and delete etcd clusters
resize: dynamically scale the cluster
backup: support backup and restore
upgrade: upgrade without service interruption
Additional recommendations:
Store etcd data on SSDs
Increase --quota-backend-bytes (default 2 GB)
Configure a dedicated etcd cluster for kube-apiserver events
3. Kube-APIServer Configuration
For node counts ≥ 3000, set:
--max-requests-inflight=3000
--max-mutating-requests-inflight=1000For node counts between 1000 and 3000, set:
--max-requests-inflight=1500
--max-mutating-requests-inflight=500Memory target (in MB) scales with node count:
--target-ram-mb=node_nums * 604. Pod Configuration
Best practices for running pods include setting resource requests and limits, especially for core add-on services:
spec.containers[].resources.limits.cpu
spec.containers[].resources.limits.memory
spec.containers[].resources.requests.cpu
spec.containers[].resources.requests.memory
spec.containers[].resources.limits.ephemeral-storage
spec.containers[].resources.requests.ephemeral-storageKubernetes classifies pods into QoS classes based on these settings: Guaranteed, Burstable, and BestEffort. When resources are scarce, the kubelet evicts pods in the order BestEffort > Burstable > Guaranteed.
Use nodeAffinity, podAffinity, and podAntiAffinity to spread critical workloads, e.g., for kube-dns:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- weight: 100
labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- kube-dns
topologyKey: kubernetes.io/hostnamePrefer managing containers with higher-level controllers such as Deployment, StatefulSet, DaemonSet, or Job.
Set kube-scheduler and controller-manager API QPS to 100 (default 50/20) and burst to 100 (default 30).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
