Cloud Native 18 min read

Avoid These 10 Common Kubernetes Mistakes to Boost Reliability and Cost Efficiency

This article shares a practical guide to the most frequent Kubernetes pitfalls—from misconfigured resource requests and limits to improper liveness/readiness probes, load‑balancer settings, IAM misuse, pod anti‑affinity, and disruption budgets—offering concrete YAML examples and remediation steps to help operators run more reliable and cost‑effective clusters.

dbaplus Community

Jul 19, 2021

Avoid These 10 Common Kubernetes Mistakes to Boost Reliability and Cost Efficiency

Drawing on extensive experience with Kubernetes clusters across GCP, AWS, and Azure, the author outlines recurring errors and provides actionable fixes to improve reliability, performance, and cost efficiency.

1. Resource Requests and Limits

Two common mistakes are omitting CPU requests or setting them too low, which can lead to node over‑commitment and degraded application performance. Example of no request (BestEffort): resources: {} Example of an excessively low CPU request:

resources:
  requests:
    cpu: "1m"

Setting inappropriate CPU limits can also cause throttling. For memory, over‑allocation triggers OOMKill; using the Guaranteed QoS mode (request equals limit) mitigates this:

resources:
  requests:
    memory: "128Mi"
    cpu: "500m"
  limits:
    memory: "256Mi"
    cpu: 2

Guaranteed QoS configuration:

resources:
  requests:
    memory: "128Mi"
    cpu: 2
  limits:
    memory: "128Mi"
    cpu: 2

Use metrics-server to view current usage:

kubectl top pods
kubectl top pods --containers
kubectl top nodes

For historical metrics, integrate Prometheus, DataDog, or similar systems, and consider the VerticalPodAutoscaler to automate request/limit adjustments.

2. Liveness and Readiness Probes

By default, probes are not configured. A failing liveness probe restarts the pod, while a failing readiness probe removes the pod from service endpoints without restarting it.

If liveness fails, the pod is restarted.

If readiness fails, the pod is detached from the service until it passes again.

Readiness probes can also be used to prevent a pod from receiving traffic when it becomes “hot.” It is generally safer to configure only readiness probes initially.

3. HTTP Service Load Balancer

Using LoadBalancer type creates external resources that can be costly. Sharing a single external load balancer via a NodePort service and deploying an ingress controller (e.g., nginx‑ingress or Traefik) is more economical.

4. Cluster‑Aware Autoscaling

External autoscalers must respect pod‑level constraints (affinity, taints, QoS). The native cluster‑autoscaler integrates with major cloud providers and handles these constraints for both scale‑out and scale‑in operations.

5. IAM/RBAC Practices

Avoid embedding static IAM user secrets; instead, use IAM roles and service accounts to obtain temporary credentials. Example ServiceAccount with IAM role annotation:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/my-app-role
  name: my-serviceaccount
  namespace: default

Do not grant admin or cluster‑admin privileges to service accounts unless absolutely necessary.

6. Pod Anti‑Affinity

Define explicit pod anti‑affinity to spread replicas across nodes, preventing a single node failure from taking down all replicas:

// omitted for brevity
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: "app"
            operator: In
            values:
            - zk
        topologyKey: "kubernetes.io/hostname"

7. PodDisruptionBudget

Create a PodDisruptionBudget to guarantee a minimum number of available pods during node maintenance:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: zk-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: zookeeper

8. Namespace Isolation

Namespaces do not provide strong isolation; mixing production and non‑production workloads can lead to resource contention. Separate clusters are recommended for true isolation.

9. externalTrafficPolicy Settings

Setting externalTrafficPolicy: Local limits traffic to nodes where pods actually run, reducing latency and egress costs compared to the default Cluster setting.

10. Stateful (Pet) Clusters and Control‑Plane Load

Stateful clusters require careful management of control‑plane resources, avoiding excessive object creation and ensuring proper disaster‑recovery procedures.

11. Summary

Kubernetes is not a silver bullet; misconfigurations can cause complexity, performance degradation, and reliability issues. Apply the above best practices, monitor resources, and design clusters with realistic expectations to achieve true cloud‑native resilience.

References

Optimizing Kubernetes Resource Requests/Limits for Cost‑Efficiency and Latency – Highload++

VerticalPodAutoscaler – Google Cloud Docs

More EKS Tips – pipetail.io

Kubernetes in Production – PodDisruptionBudget – Marek Bartik blog

Deep Dive into Kubernetes External Traffic Policies – asykim.com

k8s.af – collection of Kubernetes failure stories

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Kubernetes Resource Management autoscaling best practices Probes

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.