Mastering Kubernetes High Availability: Control Plane, Nodes, Networking, Storage, and More
This comprehensive guide walks you through designing a highly available Kubernetes cluster, covering multi‑master control‑plane deployment, worker‑node resilience, advanced networking with Cilium, durable storage with Rook/Ceph, monitoring with Thanos, security policies, disaster‑recovery strategies, cost control, and automated rollouts, all illustrated with concrete configuration snippets and real‑world performance results.
Control‑Plane High‑Availability Design
Multi‑master deployment : Deploy three master nodes across availability zones and label etcd nodes with topology.kubernetes.io/zone to enforce distribution.
# etcd configuration (/etc/etcd/etcd.conf)
ETCD_HEARTBEAT_INTERVAL="500ms"
ETCD_ELECTION_TIMEOUT="2500ms"
ETCD_MAX_REQUEST_BYTES="157286400" # increase large request throughputAPI Server load‑balancing (Nginx example) :
# Nginx upstream configuration with health checks and circuit breaking
upstream kube-apiserver {
server 10.0.1.10:6443 max_fails=3 fail_timeout=10s;
server 10.0.2.10:6443 max_fails=3 fail_timeout=10s;
check interval=5000 rise=2 fall=3 timeout=3000 type=http;
check_http_send "GET /readyz HTTP/1.0
";
check_http_expect_alive http_2xx http_3xx;
}Worker‑Node High‑Availability Design
Cluster Autoscaler advanced strategy : Reserve a dedicated GPU node pool for AI training workloads.
# AWS EKS node‑group configuration
- name: gpu-nodegroup
instanceTypes: ["p3.2xlarge"]
labels:
node.kubernetes.io/accelerator: "nvidia"
taints:
dedicated=gpu:NoSchedule
scalingConfig:
minSize: 1
maxSize: 5Custom HPA metric (Prometheus‑based QPS) :
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: 500Pod scheduling constraints (topology spread) :
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotScheduleGPU node tainting and tolerations for AI workloads :
# Label and taint the GPU node
kubectl label nodes gpu-node1 accelerator=nvidia
kubectl taint nodes gpu-node1 dedicated=ai:NoSchedule
# Pod spec snippet
spec:
tolerations:
- key: "dedicated"
operator: "Equal"
value: "ai"
effect: "NoSchedule"
containers:
- resources:
limits:
nvidia.com/gpu: 1Network High‑Availability Design
Cilium eBPF acceleration reduces CPU overhead by ~50 % and enables fine‑grained security policies.
# Install Cilium via Helm
helm install cilium cilium/cilium --namespace kube-system \
--set kubeProxyReplacement=strict \
--set k8sServiceHost=API_SERVER_IP \
--set k8sServicePort=6443Verification :
cilium status # should show "KubeProxyReplacement: Strict"Performance comparison :
Calico: 1000 policies → 25 % throughput drop
Cilium: 1000 policies → 8 % throughput drop
AWS Global Accelerator configuration (global load‑balancer) :
resource "aws_globalaccelerator_endpoint_group" "ingress" {
listener_arn = aws_globalaccelerator_listener.ingress.arn
endpoint_configuration {
endpoint_id = aws_lb.ingress.arn
weight = 100
}
}Storage High‑Availability Design
Rook/Ceph production‑grade cluster :
apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
name: rook-ceph
spec:
dataDirHostPath: /var/lib/rook
mon:
count: 3
allowMultiplePerNode: false
storage:
useAllNodes: false
nodes:
- name: "storage-node-1"
devices:
- name: "nvme0n1"Velero cross‑region backup workflow :
# Schedule daily backup
velero schedule create daily-backup --schedule="0 3 * * *" \
--include-namespaces=production \
--ttl 168h
# Create backup location in secondary AWS region
velero backup-location create secondary --provider aws \
--bucket velero-backup-dr \
--config region=eu-west-1Disaster‑recovery restore command (etcd snapshot) :
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db --data-dir /var/lib/etcd-newMonitoring & Logging
Thanos long‑term storage tuning (example arguments):
# thanos-store.yaml arguments
--retention.resolution-raw=14d
--retention.resolution-5m=180d
--objstore.config-file=/etc/thanos/s3.ymlEFK log filtering (Fluentd example) :
# Extract Kubernetes metadata
<filter kubernetes.**>
@type parser
key_name log
reserve_data true
<parse>
@type json
</parse>
</filter>Security & Compliance
OPA Gatekeeper constraint to forbid privileged containers :
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPPrivilegedContainer
spec:
match:
kinds: [{apiGroups: [""], kinds: ["Pod"]}]
parameters:
privileged: falseFalco runtime security rule (detect privileged container start) :
# Run Falco with JSON output and enable web UI
falco -r /etc/falco/falco_rules.yaml \
-o json_output=true \
-o "webserver.enabled=true"OPA image‑scan admission policy (reject high‑severity CVSS ≥ 7.0) :
# image_scan.rego
package kubernetes.admission
deny[msg] {
input.request.kind.kind == "Pod"
image := input.request.object.spec.containers[_].image
vuln_score := data.vulnerabilities[image].maxScore
vuln_score >= 7.0
msg := sprintf("Image %v has high‑severity vulnerability (CVSS %.1f)", [image, vuln_score])
}Disaster Recovery & Chaos Engineering
Federated service traffic split (multi‑cluster) :
apiVersion: types.kubefed.io/v1beta1
kind: FederatedService
metadata:
name: frontend
spec:
placement:
clusters:
- name: cluster-us
- name: cluster-eu
trafficSplit:
- cluster: cluster-us
weight: 70
- cluster: cluster-eu
weight: 30Chaos Mesh network partition to simulate AZ failure :
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: simulate-az-failure
spec:
action: partition
mode: all
selector:
namespaces: [production]
labelSelectors:
"app": "frontend"
direction: both
duration: "10m"PodChaos to kill a master node periodically :
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: kill-master
spec:
action: pod-kill
mode: one
selector:
namespaces: [kube-system]
labelSelectors:
"component": "kube-apiserver"
scheduler:
cron: "@every 10m"
duration: "5m"API Server recovery time < 1 minute
Worker‑node pod scheduling continuity
Cost Control
Kubecost budget example (monthly USD 5000 for team‑A) :
apiVersion: kubecost.com/v1alpha1
kind: Budget
metadata:
name: team-budget
spec:
target:
namespace: team-a
amount:
value: 5000
currency: USD
period: monthly
notifications:
- threshold: 80%
message: "Team A cost has reached 80% of budget"Automation
Argo Rollouts canary deployment :
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 10%
- pause: {duration: 5m}
- setWeight: 50%
- pause: {duration: 30m}
- setWeight: 100%
analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: my-serviceAutomatic rollback condition : abort rollout when request error rate > 5 %.
Key Performance Indicators
Control plane: API Server P99 latency < 500 ms
Data plane: Pod start‑up time < 5 s (cold start)
Network: Cross‑AZ latency < 10 ms
Real‑World Case Study – E‑commerce Platform
After applying the above practices, the platform achieved:
API Server availability: 99.99 % (up from 99.2 %)
Node‑failure recovery time: 2 min (down from 15 min)
Cluster scaling speed: 50 nodes/min (up from 10 nodes/min)
Recommended Toolchain
Network diagnostics: Cilium Network Observability
Storage analysis: Rook Dashboard
Cost monitoring: Kubecost + Grafana
Policy management: OPA Gatekeeper + Kyverno
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
