Operations 13 min read

Mastering Kubernetes High Availability: Control Plane, Nodes, Networking, Storage, and More

This comprehensive guide walks you through designing a highly available Kubernetes cluster, covering multi‑master control‑plane deployment, worker‑node resilience, advanced networking with Cilium, durable storage with Rook/Ceph, monitoring with Thanos, security policies, disaster‑recovery strategies, cost control, and automated rollouts, all illustrated with concrete configuration snippets and real‑world performance results.

dbaplus Community

Jun 3, 2025

Mastering Kubernetes High Availability: Control Plane, Nodes, Networking, Storage, and More

Control‑Plane High‑Availability Design

Multi‑master deployment : Deploy three master nodes across availability zones and label etcd nodes with topology.kubernetes.io/zone to enforce distribution.

# etcd configuration (/etc/etcd/etcd.conf)
ETCD_HEARTBEAT_INTERVAL="500ms"
ETCD_ELECTION_TIMEOUT="2500ms"
ETCD_MAX_REQUEST_BYTES="157286400"  # increase large request throughput

API Server load‑balancing (Nginx example) :

# Nginx upstream configuration with health checks and circuit breaking
upstream kube-apiserver {
  server 10.0.1.10:6443 max_fails=3 fail_timeout=10s;
  server 10.0.2.10:6443 max_fails=3 fail_timeout=10s;
  check interval=5000 rise=2 fall=3 timeout=3000 type=http;
  check_http_send "GET /readyz HTTP/1.0

";
  check_http_expect_alive http_2xx http_3xx;
}

Worker‑Node High‑Availability Design

Cluster Autoscaler advanced strategy : Reserve a dedicated GPU node pool for AI training workloads.

# AWS EKS node‑group configuration
- name: gpu-nodegroup
  instanceTypes: ["p3.2xlarge"]
  labels:
    node.kubernetes.io/accelerator: "nvidia"
  taints:
    dedicated=gpu:NoSchedule
  scalingConfig:
    minSize: 1
    maxSize: 5

Custom HPA metric (Prometheus‑based QPS) :

metrics:
- type: Pods
  pods:
    metric:
      name: http_requests_per_second
    target:
      type: AverageValue
      averageValue: 500

Pod scheduling constraints (topology spread) :

spec:
  topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule

GPU node tainting and tolerations for AI workloads :

# Label and taint the GPU node
kubectl label nodes gpu-node1 accelerator=nvidia
kubectl taint nodes gpu-node1 dedicated=ai:NoSchedule

# Pod spec snippet
spec:
  tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "ai"
    effect: "NoSchedule"
  containers:
  - resources:
      limits:
        nvidia.com/gpu: 1

Network High‑Availability Design

Cilium eBPF acceleration reduces CPU overhead by ~50 % and enables fine‑grained security policies.

# Install Cilium via Helm
helm install cilium cilium/cilium --namespace kube-system \
  --set kubeProxyReplacement=strict \
  --set k8sServiceHost=API_SERVER_IP \
  --set k8sServicePort=6443

Verification :

cilium status  # should show "KubeProxyReplacement: Strict"

Performance comparison :

Calico: 1000 policies → 25 % throughput drop

Cilium: 1000 policies → 8 % throughput drop

AWS Global Accelerator configuration (global load‑balancer) :

resource "aws_globalaccelerator_endpoint_group" "ingress" {
  listener_arn = aws_globalaccelerator_listener.ingress.arn
  endpoint_configuration {
    endpoint_id = aws_lb.ingress.arn
    weight      = 100
  }
}

Storage High‑Availability Design

Rook/Ceph production‑grade cluster :

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
spec:
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: false
  storage:
    useAllNodes: false
    nodes:
    - name: "storage-node-1"
      devices:
      - name: "nvme0n1"

Velero cross‑region backup workflow :

# Schedule daily backup
velero schedule create daily-backup --schedule="0 3 * * *" \
  --include-namespaces=production \
  --ttl 168h

# Create backup location in secondary AWS region
velero backup-location create secondary --provider aws \
  --bucket velero-backup-dr \
  --config region=eu-west-1

Disaster‑recovery restore command (etcd snapshot) :

ETCDCTL_API=3 etcdctl snapshot restore snapshot.db --data-dir /var/lib/etcd-new

Monitoring & Logging

Thanos long‑term storage tuning (example arguments):

# thanos-store.yaml arguments
--retention.resolution-raw=14d
--retention.resolution-5m=180d
--objstore.config-file=/etc/thanos/s3.yml

EFK log filtering (Fluentd example) :

# Extract Kubernetes metadata
<filter kubernetes.**>
  @type parser
  key_name log
  reserve_data true
  <parse>
    @type json
  </parse>
</filter>

Security & Compliance

OPA Gatekeeper constraint to forbid privileged containers :

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPPrivilegedContainer
spec:
  match:
    kinds: [{apiGroups: [""], kinds: ["Pod"]}]
  parameters:
    privileged: false

Falco runtime security rule (detect privileged container start) :

# Run Falco with JSON output and enable web UI
falco -r /etc/falco/falco_rules.yaml \
  -o json_output=true \
  -o "webserver.enabled=true"

OPA image‑scan admission policy (reject high‑severity CVSS ≥ 7.0) :

# image_scan.rego
package kubernetes.admission

deny[msg] {
  input.request.kind.kind == "Pod"
  image := input.request.object.spec.containers[_].image
  vuln_score := data.vulnerabilities[image].maxScore
  vuln_score >= 7.0
  msg := sprintf("Image %v has high‑severity vulnerability (CVSS %.1f)", [image, vuln_score])
}

Disaster Recovery & Chaos Engineering

Federated service traffic split (multi‑cluster) :

apiVersion: types.kubefed.io/v1beta1
kind: FederatedService
metadata:
  name: frontend
spec:
  placement:
    clusters:
    - name: cluster-us
    - name: cluster-eu
  trafficSplit:
  - cluster: cluster-us
    weight: 70
  - cluster: cluster-eu
    weight: 30

Chaos Mesh network partition to simulate AZ failure :

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: simulate-az-failure
spec:
  action: partition
  mode: all
  selector:
    namespaces: [production]
    labelSelectors:
      "app": "frontend"
  direction: both
  duration: "10m"

PodChaos to kill a master node periodically :

apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: kill-master
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces: [kube-system]
    labelSelectors:
      "component": "kube-apiserver"
  scheduler:
    cron: "@every 10m"
  duration: "5m"

API Server recovery time < 1 minute

Worker‑node pod scheduling continuity

Cost Control

Kubecost budget example (monthly USD 5000 for team‑A) :

apiVersion: kubecost.com/v1alpha1
kind: Budget
metadata:
  name: team-budget
spec:
  target:
    namespace: team-a
  amount:
    value: 5000
    currency: USD
  period: monthly
  notifications:
  - threshold: 80%
    message: "Team A cost has reached 80% of budget"

Automation

Argo Rollouts canary deployment :

apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      steps:
      - setWeight: 10%
      - pause: {duration: 5m}
      - setWeight: 50%
      - pause: {duration: 30m}
      - setWeight: 100%
  analysis:
    templates:
    - templateName: success-rate
  args:
  - name: service-name
    value: my-service

Automatic rollback condition : abort rollout when request error rate > 5 %.

Key Performance Indicators

Control plane: API Server P99 latency < 500 ms

Data plane: Pod start‑up time < 5 s (cold start)

Network: Cross‑AZ latency < 10 ms

Real‑World Case Study – E‑commerce Platform

After applying the above practices, the platform achieved:

API Server availability: 99.99 % (up from 99.2 %)

Node‑failure recovery time: 2 min (down from 15 min)

Cluster scaling speed: 50 nodes/min (up from 10 nodes/min)

Recommended Toolchain

Network diagnostics: Cilium Network Observability

Storage analysis: Rook Dashboard

Cost monitoring: Kubecost + Grafana

Policy management: OPA Gatekeeper + Kyverno

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Kubernetes devops Cluster Design

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.