Common Kubernetes Pitfalls and How to Fix Them
This article outlines frequent Kubernetes operational mistakes—such as misconfigured resource requests, missing probes, improper load‑balancer exposure, naïve autoscaling, IAM/RBAC misuse, lack of anti‑affinity, absent PodDisruptionBudgets, multi‑tenant pitfalls, and suboptimal externalTrafficPolicy—providing concrete remediation steps and best‑practice code examples.
1. Resource Requests and Limits
Many users omit CPU requests or set them too low, causing node over‑commitment, CPU throttling, and increased latency; similarly, improper memory limits lead to OOM kills. Recommended QoS classes are illustrated:
resources: {}Very low CPU example:
resources:
requests:
cpu: "1m"Burstable (prone to OOM):
resources:
requests:
memory: "128Mi"
cpu: "500m"
limits:
memory: "256Mi"
cpu: 2Guaranteed (requests equal limits):
resources:
requests:
memory: "128Mi"
cpu: 2
limits:
memory: "128Mi"
cpu: 2Use kubectl top pods and kubectl top nodes to monitor usage, and consider Prometheus or DataDog for historical data; VerticalPodAutoscaler can automate adjustments.
2. Liveness and Readiness Probes
Without probes, unhealthy pods are not restarted or removed from service. Liveness restarts failing pods; readiness removes them from endpoints until healthy. Both run throughout the pod lifecycle and should be configured carefully to avoid cascading failures.
3. Expose All HTTP Services via Load Balancer
Creating many type: LoadBalancer services can be costly; instead, expose services as NodePort behind a single external load balancer using an ingress controller (e.g., nginx‑ingress or Traefik) and route internally with ClusterIP services.
4. Cluster Autoscaling Without Kubernetes Awareness
External autoscalers that only look at average CPU usage may fail to scale when pods are pending due to resource requests. Use the native cluster‑autoscaler , which respects affinities, taints, tolerations, and QoS constraints for both scaling up and graceful scaling down.
5. Avoid Misusing IAM/RBAC
Prefer IAM roles and service accounts with temporary credentials over long‑lived user keys. Example ServiceAccount with IAM role annotation:
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/my-app-role
name: my-serviceaccount
namespace: defaultGrant only the minimal permissions required; avoid cluster‑admin for routine workloads.
6. Pod Self Anti‑Affinities
Define explicit anti‑affinity rules to spread replicas across nodes, preventing a single node failure from taking down all pods:
// omitted for brevity
labels:
app: zk
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- zk
topologyKey: "kubernetes.io/hostname"7. No PodDisruptionBudget
Define a PodDisruptionBudget to guarantee a minimum number of replicas during node maintenance:
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: zk-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: zookeeper8. Multiple Tenants or Environments in One Cluster
Namespaces do not provide strong isolation; achieve fairness and isolation via resource quotas, priority classes, affinities, taints, tolerations, and possibly separate clusters for production vs. non‑production workloads.
9. externalTrafficPolicy: Cluster
Using the default Cluster policy routes traffic through every node, adding latency and cost. Switch to externalTrafficPolicy: Local so only nodes running the pods receive traffic, improving latency and reducing egress charges.
10. Treating the Cluster as a Pet and Control‑Plane Pressure
Avoid naming clusters with whimsical names that hide their purpose; treat clusters as production assets, regularly test disaster recovery, and prune unused objects (e.g., old ConfigMaps/Secrets) to keep the control plane responsive.
Conclusion
Kubernetes is not a silver bullet; misconfigured applications can still cause problems. Invest time in proper resource management, probes, autoscaling, security, multi‑tenant isolation, and control‑plane hygiene to achieve a truly cloud‑native, reliable deployment.
Java Architect Essentials
Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.