Cloud Native 39 min read

How to Build a Production‑Ready Kubernetes Cluster with kubeasz: From Architecture to Full Lifecycle

This guide explains how to use kubeasz and Ansible to design, deploy, scale, secure, monitor, and maintain a production‑grade Kubernetes cluster, covering control‑plane HA, etcd reliability, networking, storage, capacity planning, upgrade strategies, and disaster‑recovery practices.

Ray's Galactic Tech
Ray's Galactic Tech
Ray's Galactic Tech
How to Build a Production‑Ready Kubernetes Cluster with kubeasz: From Architecture to Full Lifecycle

Production‑grade Kubernetes clusters

A production‑ready cluster must go beyond a simple kubectl get nodes check. It requires control‑plane high availability, consistent and low‑latency etcd, closed‑loop image distribution, networking, storage, logging, monitoring, repeatable scaling, upgrading and node replacement, and capacity planning for API server, kubelet, containerd and CNI under high load.

Why kubeasz fits production environments

kubeasz is an Ansible‑driven automation framework that uses declarative inventory and variable files to deliver standard Kubernetes components. It focuses on the engineering delivery process—initialization, scaling, upgrading and cleanup—rather than just binary installation.

Four‑layer cluster architecture

Access Layer
  - SLB / F5 / Keepalived + HAProxy
  - Ingress / Gateway API / API Gateway

Control Layer
  - kube-apiserver
  - kube-controller-manager
  - kube-scheduler
  - etcd

Run Layer
  - kubelet
  - containerd
  - CNI plugin
  - CSI plugin

Support Layer
  - Image registry
  - Monitoring & alerting
  - Logging system
  - Backup & restore
  - GitOps / CI‑CD

Most tutorials only cover the control and run layers; production stability also depends on the access and support layers.

Recommended production topology

+------------------------+
                |        VIP / SLB       |
                |      10.10.0.10:6443   |
                +-----------+------------+
                            |
          +-----------------+-----------------+
          |                                   |
   +------+------+                     +------+------+
   | LB01 HAProxy |                     | LB02 HAProxy |
   | Keepalived   |                     | Keepalived   |
   +------+------+
          |                                   |
          +-----------------+-----------------+
                            |
          +-------------------------------+
          |          Master Nodes          |
          |  master1  master2  master3    |
          +-------------------------------+
          |          Etcd Nodes           |
          |  etcd1   etcd2   etcd3       |
          +-------------------------------+
          |          Worker Pools         |
          | system | stateless | batch | middleware |
          +-------------------------------+

Node role planning

master – control plane (8 CPU 16 GiB+, no workloads)

system – DNS, monitoring, logging, ingress (8 CPU 16 GiB, taint/toleration)

stateless – general micro‑services (16 CPU 32 GiB, HPA + PDB)

batch – offline jobs, consumers (32 CPU 64 GiB, low priority, pre‑emptible)

middleware – Kafka/ES/Redis/Nacos (resource‑based, isolated, dedicated storage)

This separation prevents resource contention between system components and business workloads.

Capacity planning methodology

Pod density (80‑110 pods per node, 20 % redundancy)

Control‑plane QPS

Etcd IOPS and latency

Image distribution throughput

Example calculation: 3000 total pods ÷ 90 pods per node ÷ 0.8 ≈ 42 worker nodes.

Image distribution in restricted networks

Three‑tier strategy:

Public image accelerator

Private enterprise registry (e.g., Harbor)

Offline image packages / local cache

Operating system baseline

Supported Linux distros: Rocky Linux 8/9, Ubuntu 22.04 LTS, openEuler 22.x+

Kernel ≥ 5.4 for general workloads, ≥ 5.10 for Cilium eBPF features

Baseline configuration for all nodes

Chrony for time sync

Disable swap

Load overlay and br_netfilter modules

sysctl settings for networking, file limits, vm.max_map_count,

vm.swappiness=0

Production‑grade configuration files

Inventory example ( /etc/kubeasz/clusters/prod/hosts )

# /etc/kubeasz/clusters/prod/hosts
[all]
10.10.0.11 ansible_host=10.10.0.11 ip=10.10.0.11 etcd_name=etcd-1 node_name=master-1
10.10.0.12 ansible_host=10.10.0.12 ip=10.10.0.12 etcd_name=etcd-2 node_name=master-2
10.10.0.13 ansible_host=10.10.0.13 ip=10.10.0.13 etcd_name=etcd-3 node_name=master-3

10.10.1.21 ansible_host=10.10.1.21 ip=10.10.1.21 node_name=system-1
10.10.1.22 ansible_host=10.10.1.22 ip=10.10.1.22 node_name=system-2

10.10.2.31 ansible_host=10.10.2.31 ip=10.10.2.31 node_name=worker-a-1
10.10.2.32 ansible_host=10.10.2.32 ip=10.10.2.32 node_name=worker-a-2
10.10.2.33 ansible_host=10.10.2.33 ip=10.10.2.33 node_name=worker-a-3
10.10.2.34 ansible_host=10.10.2.34 ip=10.10.2.34 node_name=worker-a-4

[kube_master]
10.10.0.11
10.10.0.12
10.10.0.13

[etcd]
10.10.0.11
10.10.0.12
10.10.0.13

[kube_node]
10.10.1.21
10.10.1.22
10.10.2.31
10.10.2.32
10.10.2.33
10.10.2.34

[ex_lb]
10.10.0.21
10.10.0.22

[all:vars]
ansible_user=root
ansible_ssh_port=22
CLUSTER=prod
CONTAINER_RUNTIME=containerd

Core configuration ( config.yml )

# /etc/kubeasz/clusters/prod/config.yml
CLUSTER_NAME: "prod-k8s"
K8S_VER: "1.29.6"
CONTAINER_RUNTIME: "containerd"
RUNTIME_BIN_DIR: "/usr/bin"
TASK_INSTALL_CONTAINERD: true
ENABLE_LOCAL_DNS_CACHE: true
VIP: "10.10.0.10"
VIP_IF: "eth0"
CLUSTER_CIDR: "10.244.0.0/16"
SERVICE_CIDR: "10.96.0.0/16"
NODE_PORT_RANGE: "30000-32767"
PROXY_MODE: "ipvs"
DNS_DOMAIN: "cluster.local"
CNI_PLUGIN: "cilium"
CILIUM_TUNNEL_MODE: "native"
CILIUM_ENABLE_BPF_MASQUERADE: true
CILIUM_ENABLE_HUBBLE: true
ETCD_DATA_DIR: "/var/lib/etcd"
ETCD_WAL_DIR: "/var/lib/etcd/wal"
ETCD_AUTO_COMPACTION_RETENTION: "8"
ETCD_SNAPSHOT_COUNT: "10000"
KUBE_APISERVER_BIND_PORT: 6443
KUBE_APISERVER_MAX_REQUESTS_INFLIGHT: 3000
KUBE_APISERVER_MAX_MUTATING_REQUESTS_INFLIGHT: 1500
KUBE_APISERVER_EVENT_TTL: "1h"
KUBE_APISERVER_ENABLE_ADMISSION_PLUGINS:
  - NodeRestriction
  - NamespaceLifecycle
  - LimitRanger
  - ServiceAccount
  - DefaultStorageClass
  - ResourceQuota
  - Priority
  - MutatingAdmissionWebhook
  - ValidatingAdmissionWebhook
KUBELET_ROOT_DIR: "/var/lib/kubelet"
KUBE_RESERVED_ENABLED: true
KUBE_RESERVED:
  cpu: "500m"
  memory: "1Gi"
  ephemeral-storage: "5Gi"
SYSTEM_RESERVED:
  cpu: "500m"
  memory: "1Gi"
  ephemeral-storage: "5Gi"
EVICTION_HARD:
  memory.available: "500Mi"
  nodefs.available: "10%"
  imagefs.available: "15%"
METRICS_SERVER_ENABLED: true
INGRESS_NGINX_ENABLED: false
CERT_MANAGER_ENABLED: true
REGISTRY_MIRRORS:
  - "https://harbor.company.local"
  - "https://registry.aliyuncs.com"
SANDBOX_IMAGE: "harbor.company.local/google_containers/pause:3.9"

Key production settings include explicit KUBE_RESERVED and SYSTEM_RESERVED to protect system resources, EVICTION_HARD thresholds, etcd auto‑compaction, and a private registry for the pause image.

Automation scripts

OS baseline script

#!/usr/bin/env bash
set -euo pipefail

swapoff -a
sed -ri '/\sswap\s/s/^#?/#/' /etc/fstab

modprobe overlay
modprobe br_netfilter

cat >/etc/modules-load.d/k8s.conf <<'EOF'
overlay
br_netfilter
EOF

cat >/etc/sysctl.d/99-kubernetes-cri.conf <<'EOF'
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.core.somaxconn = 32768
net.ipv4.tcp_max_syn_backlog = 8192
fs.inotify.max_user_instances = 8192
fs.inotify.max_user_watches = 1048576
fs.file-max = 2097152
vm.max_map_count = 262144
vm.swappiness = 0
EOF

sysctl --system

if command -v dnf >/dev/null 2>&1; then
  dnf install -y chrony conntrack-tools ipvsadm ipset jq curl wget socat tar
else
  apt-get update
  apt-get install -y chrony conntrack ipvsadm ipset jq curl wget socat
fi

systemctl enable --now chronyd || systemctl enable --now chrony

Containerd production config ( /etc/containerd/config.toml )

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"

[plugins."io.containerd.grpc.v1.cri"]
  sandbox_image = "harbor.company.local/google_containers/pause:3.9"
  max_concurrent_downloads = 6

[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "runc"
  discard_unpacked_layers = false

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/etc/containerd/certs.d"

[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
  endpoint = ["https://harbor.company.local", "https://registry-1.docker.io"]

[plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry.k8s.io"]
  endpoint = ["https://harbor.company.local"]

[metrics]
  address = "127.0.0.1:1338"
  grpc_histogram = true

Important parameters: max_concurrent_downloads limits bandwidth spikes during scale‑out, sandbox_image pins the pause image to a private registry, and config_path centralizes registry certificates.

API Server tuning for high concurrency

apiServer:
  maxRequestsInflight: 3000
  maxMutatingRequestsInflight: 1500
  requestTimeout: "1m"
  enableProfiling: false
  auditLogMaxAge: 7
  auditLogMaxBackup: 10
  auditLogMaxSize: 100

Enable API Priority and Fairness (APF) and limit custom controller QPS to protect the control plane.

Kubelet production settings

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
address: 0.0.0.0
readOnlyPort: 0
cgroupDriver: systemd
maxPods: 110
serializeImagePulls: false
imageGCHighThresholdPercent: 80
imageGCLowThresholdPercent: 70
containerLogMaxSize: 50Mi
containerLogMaxFiles: 5
podPidsLimit: 4096
evictionHard:
  memory.available: "500Mi"
  imagefs.available: "15%"
  nodefs.available: "10%"
systemReserved:
  cpu: "500m"
  memory: "1Gi"
kubeReserved:
  cpu: "500m"
  memory: "1Gi"
shutdownGracePeriod: 30s
shutdownGracePeriodCriticalPods: 10s

Key points: disable image pull serialization, set aggressive eviction thresholds, and reserve resources for system and Kubernetes components.

Etcd operational guidelines

Deploy on SSD/NVMe with low‑latency network.

Separate from high‑IO middleware (Kafka, ES, MySQL).

Run periodic compaction and defragmentation.

Monitor etcd_disk_wal_fsync_duration_seconds, leader changes, DB size, and peer round‑trip latency.

Typical failure symptoms include API server timeouts, node status flapping, and delayed controller reconciliations.

CoreDNS and network plugin recommendation

Run at least two CoreDNS replicas.

Deploy NodeLocal DNSCache for large clusters.

Cache high‑frequency external domains.

Prefer Cilium (or Calico) over Flannel for high performance, fine‑grained policies, and eBPF observability.

Storage design considerations

System‑level storage for monitoring, logs, images, and etcd.

Business persistent storage for databases and queues.

Object and backup storage for logs, snapshots, and archives.

Etcd must use dedicated high‑performance disks; avoid placing heavy stateful middleware on shared storage.

Node expansion process

Provision hardware and apply OS baseline.

Configure SSH trust and monitoring agents.

Pre‑warm critical images.

Add nodes to the new_nodes group in the inventory.

Run kubeasz scale playbook ( 21.scale.yml).

Label, taint and add alert targets.

Gradually shift traffic using topology spread constraints.

Upgrade strategy

Validate on a test cluster.

Check CNI, CSI, Ingress, Metrics Server, cert‑manager and custom admission webhooks for compatibility.

Backup etcd and core manifests.

Upgrade control‑plane nodes one by one.

Perform staged, gray‑scale upgrades of worker nodes.

Verify API latency, etcd health and application readiness after each step.

Etcd backup & restore

ETCDCTL_API=3 etcdctl \
  --endpoints=https://10.10.0.11:2379,https://10.10.0.12:2379,https://10.10.0.13:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.pem \
  --cert=/etc/kubernetes/pki/etcd/etcd.pem \
  --key=/etc/kubernetes/pki/etcd/etcd-key.pem \
  snapshot save /backup/etcd/etcd-snapshot-$(date +%F-%H%M%S).db

Restore steps: stop the control plane, verify snapshot integrity, restore to a new data directory with the current cluster token, then bring API servers back online.

Security baseline

Disable anonymous API access.

Enable RBAC and NodeRestriction.

Collect and centralize audit logs.

Restrict privileged pods by default.

Apply ResourceQuota and LimitRange per namespace.

Enforce image signing, scanning and admission checks.

Encrypt Secrets and integrate external KMS.

Network isolation policies (example)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
  namespace: prod-business
spec:
  podSelector: {}
  policyTypes:
  - Ingress
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-from-ingress
  namespace: prod-business
spec:
  podSelector:
    matchLabels:
      app: order-service
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: ingress-nginx
    ports:
    - port: 8080
      protocol: TCP

Observability stack

Cluster‑level metrics: API latency, etcd fsync, node disk pressure.

Application metrics via Prometheus.

Alert rules covering high API latency, etcd fsync latency, and container runtime disk usage.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kube-platform-rules
  namespace: monitoring
spec:
  groups:
  - name: kubernetes-platform
    rules:
    - alert: KubeAPIServerHighLatency
      expr: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le, verb)) > 1
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "API Server P99 latency is high"
    - alert: EtcdHighFsyncLatency
      expr: histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (le)) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Etcd fsync latency exceeds 50ms"
    - alert: NodeDiskPressureRisk
      expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/containerd"} / node_filesystem_size_bytes{mountpoint="/var/lib/containerd"}) < 0.15
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "Container runtime disk usage is too high"

Logging recommendations

Standard output/error for application logs.

Separate system logs from container logs.

Isolate audit and security logs.

Use Loki, ELK or OpenSearch with hot/cold tiers and retention policies.

Common pitfalls and troubleshooting

Slow or failed image pulls

Test pull with ctr -n k8s.io images pull <image>.

Verify /etc/containerd/certs.d configuration.

Check network and TLS to the private registry.

Confirm base images are synchronized.

Frequent Node NotReady

Kubelet resource limits.

CNI failures.

Clock drift.

Disk pressure.

API server or etcd timeouts.

Inspect journalctl -u kubelet, journalctl -u containerd, CNI pod logs, disk/inode usage, and API latency.

Performance degradation after scaling

Image pull bandwidth saturation.

Insufficient CoreDNS replicas.

Missing NodeLocal DNSCache.

Un‑pre‑warmed images on new nodes.

CNI BPF map overload.

Massive Pod Pending

ResourceQuota limits.

Taint/toleration mismatches.

Node label selectors.

PVC binding failures.

PDB blocking rollouts.

Image pull errors.

Roadmap to production

Phase 1 – Stabilize the cluster

3‑master HA with load balancer.

Private enterprise registry.

Standardized deployment and scaling scripts.

Prometheus‑Grafana‑Alertmanager stack.

Etcd backup schedule.

Phase 2 – Enforce application governance

Namespace segmentation.

Quota, LimitRange and NetworkPolicy enforcement.

Release, rollback and resource‑request standards.

Standardized HPA, PDB, probes and graceful termination.

Phase 3 – Engineer the platform

GitOps for declarative cluster state.

Cluster audit and change‑approval workflow.

Image signing and scanning pipeline.

Multi‑environment delivery pipelines.

Regular failure‑simulation exercises.

Phase 4 – Build scalability features

Multiple resource pools (system, stateless, batch, middleware).

Multi‑datacenter or multi‑cluster federation.

Capacity modeling and auto‑scaling.

Automatic node onboarding.

Cost‑optimized elasticity.

Final checklist before production launch

Control plane ≥ 3 nodes, load balancer without single point of failure.

Etcd on SSD/NVMe with backup policy.

Enterprise registry reachable and critical images mirrored.

All nodes share OS baseline and synchronized clocks.

Containerd, Kubelet and CNI tuned for production workloads.

CoreDNS replicas and optional NodeLocal DNSCache deployed.

Namespace, ResourceQuota, LimitRange and NetworkPolicy in place.

Monitoring, logging, alerting and audit pipelines operational.

Validated scripts for scaling, node replacement, upgrade and rollback.

Completed at least one Etcd restore drill and one master‑failure drill.

Core business Deployments include probes, PDB, HPA and graceful termination.

observabilityhigh availabilityKubernetesCluster DeploymentAnsibleProductionkubeasz
Ray's Galactic Tech
Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.