Operations 15 min read

Common Kubernetes Pitfalls and How to Fix Them

This article outlines frequent Kubernetes operational mistakes—such as misconfigured resource requests, missing probes, improper load‑balancer exposure, naïve autoscaling, IAM/RBAC misuse, lack of anti‑affinity, absent PodDisruptionBudgets, multi‑tenant pitfalls, and suboptimal externalTrafficPolicy—providing concrete remediation steps and best‑practice code examples.

Java Architect Essentials
Java Architect Essentials
Java Architect Essentials
Common Kubernetes Pitfalls and How to Fix Them

1. Resource Requests and Limits

Many users omit CPU requests or set them too low, causing node over‑commitment, CPU throttling, and increased latency; similarly, improper memory limits lead to OOM kills. Recommended QoS classes are illustrated:

resources: {}

Very low CPU example:

resources:
  requests:
    cpu: "1m"

Burstable (prone to OOM):

resources:
  requests:
    memory: "128Mi"
    cpu: "500m"
  limits:
    memory: "256Mi"
    cpu: 2

Guaranteed (requests equal limits):

resources:
  requests:
    memory: "128Mi"
    cpu: 2
  limits:
    memory: "128Mi"
    cpu: 2

Use kubectl top pods and kubectl top nodes to monitor usage, and consider Prometheus or DataDog for historical data; VerticalPodAutoscaler can automate adjustments.

2. Liveness and Readiness Probes

Without probes, unhealthy pods are not restarted or removed from service. Liveness restarts failing pods; readiness removes them from endpoints until healthy. Both run throughout the pod lifecycle and should be configured carefully to avoid cascading failures.

3. Expose All HTTP Services via Load Balancer

Creating many type: LoadBalancer services can be costly; instead, expose services as NodePort behind a single external load balancer using an ingress controller (e.g., nginx‑ingress or Traefik) and route internally with ClusterIP services.

4. Cluster Autoscaling Without Kubernetes Awareness

External autoscalers that only look at average CPU usage may fail to scale when pods are pending due to resource requests. Use the native cluster‑autoscaler , which respects affinities, taints, tolerations, and QoS constraints for both scaling up and graceful scaling down.

5. Avoid Misusing IAM/RBAC

Prefer IAM roles and service accounts with temporary credentials over long‑lived user keys. Example ServiceAccount with IAM role annotation:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/my-app-role
  name: my-serviceaccount
  namespace: default

Grant only the minimal permissions required; avoid cluster‑admin for routine workloads.

6. Pod Self Anti‑Affinities

Define explicit anti‑affinity rules to spread replicas across nodes, preventing a single node failure from taking down all pods:

// omitted for brevity
labels:
  app: zk
affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: "app"
          operator: In
          values:
          - zk
      topologyKey: "kubernetes.io/hostname"

7. No PodDisruptionBudget

Define a PodDisruptionBudget to guarantee a minimum number of replicas during node maintenance:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: zk-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: zookeeper

8. Multiple Tenants or Environments in One Cluster

Namespaces do not provide strong isolation; achieve fairness and isolation via resource quotas, priority classes, affinities, taints, tolerations, and possibly separate clusters for production vs. non‑production workloads.

9. externalTrafficPolicy: Cluster

Using the default Cluster policy routes traffic through every node, adding latency and cost. Switch to externalTrafficPolicy: Local so only nodes running the pods receive traffic, improving latency and reducing egress charges.

10. Treating the Cluster as a Pet and Control‑Plane Pressure

Avoid naming clusters with whimsical names that hide their purpose; treat clusters as production assets, regularly test disaster recovery, and prune unused objects (e.g., old ConfigMaps/Secrets) to keep the control plane responsive.

Conclusion

Kubernetes is not a silver bullet; misconfigured applications can still cause problems. Invest time in proper resource management, probes, autoscaling, security, multi‑tenant isolation, and control‑plane hygiene to achieve a truly cloud‑native, reliable deployment.

KubernetesResource ManagementautoscalingBest PracticessecurityProbes
Java Architect Essentials
Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.