20 Essential Kubernetes Ops Tips to Keep Production Clusters Stable
This guide compiles twenty practical Kubernetes operations tips drawn from real‑world production experience, covering high availability, performance tuning, monitoring, automation, security, and advanced learning to help teams build and maintain reliable, resilient clusters.
1. High Availability & Stability: Common Failure Points
1. Build a truly HA architecture
Production baseline: etcd should have at least three odd‑numbered nodes, control‑plane at least two nodes, and all components must be spread across different availability zones or physical machines.
etcd ≥ 3 nodes (odd)
control‑plane ≥ 2 nodes
Distribute across zones/hosts
Lesson: a single‑node etcd is a single point of failure that can cause a cluster‑wide outage.
2. Use Pod Affinity & Anti‑Affinity wisely
Goal: avoid single‑point failures by preventing pods of the same service from landing on the same host.
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: order-service
topologyKey: kubernetes.io/hostname3. Configure PodDisruptionBudget (PDB)
Missing PDB often leads to “upgrade = downtime”.
apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
minAvailable: 2Without a PDB, node upgrades will kill workloads.
4. Rolling updates with graceful termination
Deployments should set three parameters:
maxSurge maxUnavailable preStop + terminationGracePeriodSeconds lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 10"]Traffic is drained before pods exit.
5. Autoscaling beyond a single HPA
HPA : scales pods
VPA : adjusts resource requests
Cluster Autoscaler : adds nodes
Running HPA without defined requests is like driving blind.
2. Performance & Resource Optimization
6. Set requests and limits for every pod
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"Omitting them means the first pod to be evicted when the node is exhausted.
7. Namespace ResourceQuota
Prevents a single team from exhausting cluster resources.
apiVersion: v1
kind: ResourceQuota
spec:
hard:
requests.cpu: "20"
requests.memory: 40Gi8. Network plugin & kube‑proxy tuning
Calico: feature‑rich
Flannel: simple
kube‑proxy: IPVS > iptables
For high‑traffic clusters, IPVS is the preferred mode.
9. Image and storage best practices
Use minimal images such as alpine or distroless Storage: SSD + high‑performance StorageClass
Access mode: ReadWriteOnce – do not pick arbitrarily
10. Regularly clean up unused resources
Completed Jobs
Evicted Pods
Dangling images
Neglected cleanup leads to slow clusters and full disks.
3. Monitoring & Troubleshooting
11. Core monitoring & alerting stack
Prometheus
Grafana
Alertmanager
No alerts means you are unaware of failures.
12. Centralized logging
Choose either EFK or Loki to enable log queries per pod, node, and namespace.
13. kubectl troubleshooting commands
kubectl describe pod
kubectl logs
kubectl exec
kubectl get events14. etcd backup – cluster insurance
etcdctl snapshot save backup.dbSchedule regular backups
Store snapshots on a different machine
Practice restore procedures
Loss of etcd means loss of all YAML manifests.
15. Visual tools are not decorative
K9s – terminal UI
Lens – multi‑cluster management
Kuboard – China‑friendly UI
4. Automation & CI/CD
16. GitOps as the preferred Kubernetes workflow
Git is the single source of truth
Argo CD / Flux for continuous delivery
Rollback‑able releases are true releases.
17. Helm for complex applications
Parameterization
Versioning
Rollback capability
If a chart has more than five YAML files, Helm saves you from painful manual updates.
18. Connect CI/CD pipelines to Kubernetes
Jenkins / GitLab CI: build, scan, deploy, rollback
5. Security & Compliance
19. RBAC least‑privilege principle
Avoid granting cluster-admin indiscriminately
Assign ServiceAccounts specific permissions
Uncontrolled permissions expand the internal attack surface.
20. NetworkPolicy + TLS
Default deny all traffic
Allow only as needed
Ingress must use TLS
Without network isolation, a single compromised pod can jeopardize the entire cluster.
Advanced Learning Paths
Service mesh: Istio / Linkerd
Stateful workloads: StatefulSet + Operator
Multi‑cluster management: Cluster API / Karmada
80% of Kubernetes incidents stem from the 20% of basic operations that are not done properly.
If you can only start with five items, prioritize: PDB, requests/limits, etcd backup, monitoring & alerts, and ResourceQuota.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
