How to Build a Production‑Ready Kubernetes Cluster with kubeasz: From Architecture to Full Lifecycle
This guide explains how to use kubeasz and Ansible to design, deploy, scale, secure, monitor, and maintain a production‑grade Kubernetes cluster, covering control‑plane HA, etcd reliability, networking, storage, capacity planning, upgrade strategies, and disaster‑recovery practices.
Production‑grade Kubernetes clusters
A production‑ready cluster must go beyond a simple kubectl get nodes check. It requires control‑plane high availability, consistent and low‑latency etcd, closed‑loop image distribution, networking, storage, logging, monitoring, repeatable scaling, upgrading and node replacement, and capacity planning for API server, kubelet, containerd and CNI under high load.
Why kubeasz fits production environments
kubeasz is an Ansible‑driven automation framework that uses declarative inventory and variable files to deliver standard Kubernetes components. It focuses on the engineering delivery process—initialization, scaling, upgrading and cleanup—rather than just binary installation.
Four‑layer cluster architecture
Access Layer
- SLB / F5 / Keepalived + HAProxy
- Ingress / Gateway API / API Gateway
Control Layer
- kube-apiserver
- kube-controller-manager
- kube-scheduler
- etcd
Run Layer
- kubelet
- containerd
- CNI plugin
- CSI plugin
Support Layer
- Image registry
- Monitoring & alerting
- Logging system
- Backup & restore
- GitOps / CI‑CDMost tutorials only cover the control and run layers; production stability also depends on the access and support layers.
Recommended production topology
+------------------------+
| VIP / SLB |
| 10.10.0.10:6443 |
+-----------+------------+
|
+-----------------+-----------------+
| |
+------+------+ +------+------+
| LB01 HAProxy | | LB02 HAProxy |
| Keepalived | | Keepalived |
+------+------+
| |
+-----------------+-----------------+
|
+-------------------------------+
| Master Nodes |
| master1 master2 master3 |
+-------------------------------+
| Etcd Nodes |
| etcd1 etcd2 etcd3 |
+-------------------------------+
| Worker Pools |
| system | stateless | batch | middleware |
+-------------------------------+Node role planning
master – control plane (8 CPU 16 GiB+, no workloads)
system – DNS, monitoring, logging, ingress (8 CPU 16 GiB, taint/toleration)
stateless – general micro‑services (16 CPU 32 GiB, HPA + PDB)
batch – offline jobs, consumers (32 CPU 64 GiB, low priority, pre‑emptible)
middleware – Kafka/ES/Redis/Nacos (resource‑based, isolated, dedicated storage)
This separation prevents resource contention between system components and business workloads.
Capacity planning methodology
Pod density (80‑110 pods per node, 20 % redundancy)
Control‑plane QPS
Etcd IOPS and latency
Image distribution throughput
Example calculation: 3000 total pods ÷ 90 pods per node ÷ 0.8 ≈ 42 worker nodes.
Image distribution in restricted networks
Three‑tier strategy:
Public image accelerator
Private enterprise registry (e.g., Harbor)
Offline image packages / local cache
Operating system baseline
Supported Linux distros: Rocky Linux 8/9, Ubuntu 22.04 LTS, openEuler 22.x+
Kernel ≥ 5.4 for general workloads, ≥ 5.10 for Cilium eBPF features
Baseline configuration for all nodes
Chrony for time sync
Disable swap
Load overlay and br_netfilter modules
sysctl settings for networking, file limits, vm.max_map_count,
vm.swappiness=0Production‑grade configuration files
Inventory example ( /etc/kubeasz/clusters/prod/hosts )
# /etc/kubeasz/clusters/prod/hosts
[all]
10.10.0.11 ansible_host=10.10.0.11 ip=10.10.0.11 etcd_name=etcd-1 node_name=master-1
10.10.0.12 ansible_host=10.10.0.12 ip=10.10.0.12 etcd_name=etcd-2 node_name=master-2
10.10.0.13 ansible_host=10.10.0.13 ip=10.10.0.13 etcd_name=etcd-3 node_name=master-3
10.10.1.21 ansible_host=10.10.1.21 ip=10.10.1.21 node_name=system-1
10.10.1.22 ansible_host=10.10.1.22 ip=10.10.1.22 node_name=system-2
10.10.2.31 ansible_host=10.10.2.31 ip=10.10.2.31 node_name=worker-a-1
10.10.2.32 ansible_host=10.10.2.32 ip=10.10.2.32 node_name=worker-a-2
10.10.2.33 ansible_host=10.10.2.33 ip=10.10.2.33 node_name=worker-a-3
10.10.2.34 ansible_host=10.10.2.34 ip=10.10.2.34 node_name=worker-a-4
[kube_master]
10.10.0.11
10.10.0.12
10.10.0.13
[etcd]
10.10.0.11
10.10.0.12
10.10.0.13
[kube_node]
10.10.1.21
10.10.1.22
10.10.2.31
10.10.2.32
10.10.2.33
10.10.2.34
[ex_lb]
10.10.0.21
10.10.0.22
[all:vars]
ansible_user=root
ansible_ssh_port=22
CLUSTER=prod
CONTAINER_RUNTIME=containerdCore configuration ( config.yml )
# /etc/kubeasz/clusters/prod/config.yml
CLUSTER_NAME: "prod-k8s"
K8S_VER: "1.29.6"
CONTAINER_RUNTIME: "containerd"
RUNTIME_BIN_DIR: "/usr/bin"
TASK_INSTALL_CONTAINERD: true
ENABLE_LOCAL_DNS_CACHE: true
VIP: "10.10.0.10"
VIP_IF: "eth0"
CLUSTER_CIDR: "10.244.0.0/16"
SERVICE_CIDR: "10.96.0.0/16"
NODE_PORT_RANGE: "30000-32767"
PROXY_MODE: "ipvs"
DNS_DOMAIN: "cluster.local"
CNI_PLUGIN: "cilium"
CILIUM_TUNNEL_MODE: "native"
CILIUM_ENABLE_BPF_MASQUERADE: true
CILIUM_ENABLE_HUBBLE: true
ETCD_DATA_DIR: "/var/lib/etcd"
ETCD_WAL_DIR: "/var/lib/etcd/wal"
ETCD_AUTO_COMPACTION_RETENTION: "8"
ETCD_SNAPSHOT_COUNT: "10000"
KUBE_APISERVER_BIND_PORT: 6443
KUBE_APISERVER_MAX_REQUESTS_INFLIGHT: 3000
KUBE_APISERVER_MAX_MUTATING_REQUESTS_INFLIGHT: 1500
KUBE_APISERVER_EVENT_TTL: "1h"
KUBE_APISERVER_ENABLE_ADMISSION_PLUGINS:
- NodeRestriction
- NamespaceLifecycle
- LimitRanger
- ServiceAccount
- DefaultStorageClass
- ResourceQuota
- Priority
- MutatingAdmissionWebhook
- ValidatingAdmissionWebhook
KUBELET_ROOT_DIR: "/var/lib/kubelet"
KUBE_RESERVED_ENABLED: true
KUBE_RESERVED:
cpu: "500m"
memory: "1Gi"
ephemeral-storage: "5Gi"
SYSTEM_RESERVED:
cpu: "500m"
memory: "1Gi"
ephemeral-storage: "5Gi"
EVICTION_HARD:
memory.available: "500Mi"
nodefs.available: "10%"
imagefs.available: "15%"
METRICS_SERVER_ENABLED: true
INGRESS_NGINX_ENABLED: false
CERT_MANAGER_ENABLED: true
REGISTRY_MIRRORS:
- "https://harbor.company.local"
- "https://registry.aliyuncs.com"
SANDBOX_IMAGE: "harbor.company.local/google_containers/pause:3.9"Key production settings include explicit KUBE_RESERVED and SYSTEM_RESERVED to protect system resources, EVICTION_HARD thresholds, etcd auto‑compaction, and a private registry for the pause image.
Automation scripts
OS baseline script
#!/usr/bin/env bash
set -euo pipefail
swapoff -a
sed -ri '/\sswap\s/s/^#?/#/' /etc/fstab
modprobe overlay
modprobe br_netfilter
cat >/etc/modules-load.d/k8s.conf <<'EOF'
overlay
br_netfilter
EOF
cat >/etc/sysctl.d/99-kubernetes-cri.conf <<'EOF'
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.core.somaxconn = 32768
net.ipv4.tcp_max_syn_backlog = 8192
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 1048576
fs.file-max = 2097152
vm.max_map_count = 262144
vm.swappiness = 0
EOF
sysctl --system
if command -v dnf >/dev/null 2>&1; then
dnf install -y chrony conntrack-tools ipvsadm ipset jq curl wget socat tar
else
apt-get update
apt-get install -y chrony conntrack ipvsadm ipset jq curl wget socat
fi
systemctl enable --now chronyd || systemctl enable --now chronyContainerd production config ( /etc/containerd/config.toml )
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "harbor.company.local/google_containers/pause:3.9"
max_concurrent_downloads = 6
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
discard_unpacked_layers = false
[plugins."io.containerd.grpc.v1.cri".registry]
config_path = "/etc/containerd/certs.d"
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
endpoint = ["https://harbor.company.local", "https://registry-1.docker.io"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry.k8s.io"]
endpoint = ["https://harbor.company.local"]
[metrics]
address = "127.0.0.1:1338"
grpc_histogram = trueImportant parameters: max_concurrent_downloads limits bandwidth spikes during scale‑out, sandbox_image pins the pause image to a private registry, and config_path centralizes registry certificates.
API Server tuning for high concurrency
apiServer:
maxRequestsInflight: 3000
maxMutatingRequestsInflight: 1500
requestTimeout: "1m"
enableProfiling: false
auditLogMaxAge: 7
auditLogMaxBackup: 10
auditLogMaxSize: 100Enable API Priority and Fairness (APF) and limit custom controller QPS to protect the control plane.
Kubelet production settings
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: 0.0.0.0
readOnlyPort: 0
cgroupDriver: systemd
maxPods: 110
serializeImagePulls: false
imageGCHighThresholdPercent: 80
imageGCLowThresholdPercent: 70
containerLogMaxSize: 50Mi
containerLogMaxFiles: 5
podPidsLimit: 4096
evictionHard:
memory.available: "500Mi"
imagefs.available: "15%"
nodefs.available: "10%"
systemReserved:
cpu: "500m"
memory: "1Gi"
kubeReserved:
cpu: "500m"
memory: "1Gi"
shutdownGracePeriod: 30s
shutdownGracePeriodCriticalPods: 10sKey points: disable image pull serialization, set aggressive eviction thresholds, and reserve resources for system and Kubernetes components.
Etcd operational guidelines
Deploy on SSD/NVMe with low‑latency network.
Separate from high‑IO middleware (Kafka, ES, MySQL).
Run periodic compaction and defragmentation.
Monitor etcd_disk_wal_fsync_duration_seconds, leader changes, DB size, and peer round‑trip latency.
Typical failure symptoms include API server timeouts, node status flapping, and delayed controller reconciliations.
CoreDNS and network plugin recommendation
Run at least two CoreDNS replicas.
Deploy NodeLocal DNSCache for large clusters.
Cache high‑frequency external domains.
Prefer Cilium (or Calico) over Flannel for high performance, fine‑grained policies, and eBPF observability.
Storage design considerations
System‑level storage for monitoring, logs, images, and etcd.
Business persistent storage for databases and queues.
Object and backup storage for logs, snapshots, and archives.
Etcd must use dedicated high‑performance disks; avoid placing heavy stateful middleware on shared storage.
Node expansion process
Provision hardware and apply OS baseline.
Configure SSH trust and monitoring agents.
Pre‑warm critical images.
Add nodes to the new_nodes group in the inventory.
Run kubeasz scale playbook ( 21.scale.yml).
Label, taint and add alert targets.
Gradually shift traffic using topology spread constraints.
Upgrade strategy
Validate on a test cluster.
Check CNI, CSI, Ingress, Metrics Server, cert‑manager and custom admission webhooks for compatibility.
Backup etcd and core manifests.
Upgrade control‑plane nodes one by one.
Perform staged, gray‑scale upgrades of worker nodes.
Verify API latency, etcd health and application readiness after each step.
Etcd backup & restore
ETCDCTL_API=3 etcdctl \
--endpoints=https://10.10.0.11:2379,https://10.10.0.12:2379,https://10.10.0.13:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.pem \
--cert=/etc/kubernetes/pki/etcd/etcd.pem \
--key=/etc/kubernetes/pki/etcd/etcd-key.pem \
snapshot save /backup/etcd/etcd-snapshot-$(date +%F-%H%M%S).dbRestore steps: stop the control plane, verify snapshot integrity, restore to a new data directory with the current cluster token, then bring API servers back online.
Security baseline
Disable anonymous API access.
Enable RBAC and NodeRestriction.
Collect and centralize audit logs.
Restrict privileged pods by default.
Apply ResourceQuota and LimitRange per namespace.
Enforce image signing, scanning and admission checks.
Encrypt Secrets and integrate external KMS.
Network isolation policies (example)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: prod-business
spec:
podSelector: {}
policyTypes:
- Ingress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-from-ingress
namespace: prod-business
spec:
podSelector:
matchLabels:
app: order-service
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
ports:
- port: 8080
protocol: TCPObservability stack
Cluster‑level metrics: API latency, etcd fsync, node disk pressure.
Application metrics via Prometheus.
Alert rules covering high API latency, etcd fsync latency, and container runtime disk usage.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kube-platform-rules
namespace: monitoring
spec:
groups:
- name: kubernetes-platform
rules:
- alert: KubeAPIServerHighLatency
expr: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb)) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "API Server P99 latency is high"
- alert: EtcdHighFsyncLatency
expr: histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (le)) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Etcd fsync latency exceeds 50ms"
- alert: NodeDiskPressureRisk
expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/containerd"} / node_filesystem_size_bytes{mountpoint="/var/lib/containerd"}) < 0.15
for: 10m
labels:
severity: warning
annotations:
summary: "Container runtime disk usage is too high"Logging recommendations
Standard output/error for application logs.
Separate system logs from container logs.
Isolate audit and security logs.
Use Loki, ELK or OpenSearch with hot/cold tiers and retention policies.
Common pitfalls and troubleshooting
Slow or failed image pulls
Test pull with ctr -n k8s.io images pull <image>.
Verify /etc/containerd/certs.d configuration.
Check network and TLS to the private registry.
Confirm base images are synchronized.
Frequent Node NotReady
Kubelet resource limits.
CNI failures.
Clock drift.
Disk pressure.
API server or etcd timeouts.
Inspect journalctl -u kubelet, journalctl -u containerd, CNI pod logs, disk/inode usage, and API latency.
Performance degradation after scaling
Image pull bandwidth saturation.
Insufficient CoreDNS replicas.
Missing NodeLocal DNSCache.
Un‑pre‑warmed images on new nodes.
CNI BPF map overload.
Massive Pod Pending
ResourceQuota limits.
Taint/toleration mismatches.
Node label selectors.
PVC binding failures.
PDB blocking rollouts.
Image pull errors.
Roadmap to production
Phase 1 – Stabilize the cluster
3‑master HA with load balancer.
Private enterprise registry.
Standardized deployment and scaling scripts.
Prometheus‑Grafana‑Alertmanager stack.
Etcd backup schedule.
Phase 2 – Enforce application governance
Namespace segmentation.
Quota, LimitRange and NetworkPolicy enforcement.
Release, rollback and resource‑request standards.
Standardized HPA, PDB, probes and graceful termination.
Phase 3 – Engineer the platform
GitOps for declarative cluster state.
Cluster audit and change‑approval workflow.
Image signing and scanning pipeline.
Multi‑environment delivery pipelines.
Regular failure‑simulation exercises.
Phase 4 – Build scalability features
Multiple resource pools (system, stateless, batch, middleware).
Multi‑datacenter or multi‑cluster federation.
Capacity modeling and auto‑scaling.
Automatic node onboarding.
Cost‑optimized elasticity.
Final checklist before production launch
Control plane ≥ 3 nodes, load balancer without single point of failure.
Etcd on SSD/NVMe with backup policy.
Enterprise registry reachable and critical images mirrored.
All nodes share OS baseline and synchronized clocks.
Containerd, Kubelet and CNI tuned for production workloads.
CoreDNS replicas and optional NodeLocal DNSCache deployed.
Namespace, ResourceQuota, LimitRange and NetworkPolicy in place.
Monitoring, logging, alerting and audit pipelines operational.
Validated scripts for scaling, node replacement, upgrade and rollback.
Completed at least one Etcd restore drill and one master‑failure drill.
Core business Deployments include probes, PDB, HPA and graceful termination.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
