Cloud Native 36 min read

Master Kubernetes Troubleshooting: From Pod Crashes to Network Failures

This comprehensive guide walks you through Kubernetes fault‑tolerance by covering core components, classifying six major failure types, presenting a three‑step troubleshooting methodology, and detailing six real‑world case studies with commands, manifests, monitoring setups and preventive best practices.

Raymond Ops
Raymond Ops
Raymond Ops
Master Kubernetes Troubleshooting: From Pod Crashes to Network Failures

Introduction

Kubernetes is the de‑facto standard for container orchestration, but its complex architecture introduces operational challenges. Over 60% of production incidents stem from mis‑configurations or resource‑management errors. This guide presents a systematic methodology for diagnosing issues across pod lifecycle, networking, storage, and scheduling, illustrated with six real‑world cases.

Technical Background

Kubernetes Architecture Overview

Control‑plane components

kube-apiserver : entry point for all REST requests.

etcd : single source of truth for cluster state.

kube-scheduler : makes pod‑placement decisions.

kube-controller-manager : runs built‑in controllers such as ReplicaSet and Deployment.

cloud-controller-manager : integrates cloud‑provider APIs.

Node components

kubelet : agent that manages pod lifecycle on each node.

kube-proxy : maintains network rules and implements Service abstraction.

Container runtime : Docker, containerd, CRI‑O, etc.

Common Failure Types

Pod status anomalies – Pending, CrashLoopBackOff, ImagePullBackOff, Error.

Resource scheduling problems – insufficient resources, affinity conflicts, taint‑toleration mismatches.

Network communication failures – Service unreachable, DNS resolution errors, cross‑node connectivity issues.

Storage mounting problems – PVC binding failures, mount timeouts, permission errors.

Node‑level faults – NotReady, disk pressure, memory pressure.

Configuration errors – YAML syntax, RBAC issues, improper resource limits.

Fault‑Diagnosis Methodology

Three‑step approach

Step 1 – Information Collection

# List pods with wide output
kubectl get pods -o wide
kubectl describe pod POD_NAME
kubectl get events --sort-by='.lastTimestamp'

Step 2 – Log Analysis

# View container logs
kubectl logs POD_NAME
kubectl logs POD_NAME -c CONTAINER_NAME
kubectl logs POD_NAME --previous   # previous crashed container logs

Step 3 – Deep Diagnosis

# Exec into the container
kubectl exec -it POD_NAME -- /bin/sh
# Inspect node status
kubectl describe node NODE_NAME

Core Content

Pod Lifecycle and Status

Pod phases and typical causes:

Status      Meaning                     Common Causes
Pending     Scheduling or resource wait   Insufficient resources, image pull, volume not ready
Running     Normal operation            -
Succeeded   Job/CronJob completed       -
Failed      Execution failure           Container exit code non‑zero
Unknown     Unable to obtain status      Node communication failure
CrashLoopBackOff Repeated crashes       Application start failure, failed health checks
ImagePullBackOff Image pull failure       Image not found, authentication error, network issue

Useful inspection commands:

# Detailed pod status
kubectl get pod POD_NAME -o yaml | grep -A 10 status
# Container restart count
kubectl get pod POD_NAME -o jsonpath='{.status.containerStatuses[*].restartCount}'
# Container readiness
kubectl get pod POD_NAME -o jsonpath='{.status.containerStatuses[*].ready}'

Common Troubleshooting Commands

Basic information

# List all pods across namespaces
kubectl get pods -A
# Wide view with IP, node, start time
kubectl get pods -o wide -n NAMESPACE
# Full YAML of a pod
kubectl get pod POD_NAME -o yaml
# Most used debugging command
kubectl describe pod POD_NAME -n NAMESPACE

Log inspection

# Tail last 100 lines
kubectl logs POD_NAME --tail=100
# Follow logs (like tail -f)
kubectl logs -f POD_NAME
# Specific container in multi‑container pod
kubectl logs POD_NAME -c CONTAINER_NAME
# All containers
kubectl logs POD_NAME --all-containers=true
# Previous (crashed) container logs
kubectl logs POD_NAME --previous
# Add timestamps
kubectl logs POD_NAME --timestamps=true
# Logs from the last hour
kubectl logs POD_NAME --since=1h

Event view

# Cluster‑wide events sorted by time
kubectl get events --sort-by='.lastTimestamp'
# Namespace‑specific events
kubectl get events -n NAMESPACE --sort-by='.lastTimestamp'
# Events related to a specific pod
kubectl get events --field-selector involvedObject.name=POD_NAME
# Warning‑level events only
kubectl get events --field-selector type=Warning

Resource usage

# Node resource usage (requires metrics‑server)
kubectl top nodes
# Pod resource usage
kubectl top pods -n NAMESPACE
# Specific pod resource usage (including containers)
kubectl top pod POD_NAME --containers

Log Analysis Techniques

Check container exit codes:

kubectl get pod POD_NAME -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

0 – normal exit

1 – application error

137 – OOMKilled

143 – SIGTERM (graceful stop)

255 – exit code out of range

Detect OOMKilled:

# Search for OOMKilled in pod description
kubectl describe pod POD_NAME | grep -i "OOMKilled"
# Verify termination reason
kubectl get pod POD_NAME -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

Aggregate logs across pods:

# View logs of multiple pods
kubectl logs -l app=nginx --tail=50
# Use stern (recommended)
stern POD_PREFIX -n NAMESPACE

Resource Limits and Scheduling Issues

Example resource request/limit manifest:

apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        memory: "256Mi"
        cpu: "250m"
      limits:
        memory: "512Mi"
        cpu: "500m"

Commands to investigate scheduling failures:

# Show why a pod cannot be scheduled
kubectl describe pod POD_NAME | grep -A 5 "Events"
# Show node available resources
kubectl describe nodes | grep -A 5 "Allocated resources"
# Show node labels and taints
kubectl get nodes --show-labels
kubectl describe node NODE_NAME | grep Taints

Typical failure messages:

# Insufficient memory
0/3 nodes are available: 3 Insufficient memory.
# Node selector mismatch
0/3 nodes are available: 3 node(s) didn't match node selector.
# Taint‑toleration mismatch
0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.

Storage and Persistence Problems

PVC status commands:

# List PVCs
kubectl get pvc -n NAMESPACE
# Detailed PVC info
kubectl describe pvc PVC_NAME
# List PVs
kubectl get pv
# Detailed PV info
kubectl describe pv PV_NAME

Example PVC manifest (ReadWriteOnce):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mysql-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: standard

To switch to a multi‑writer volume (if supported):

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mysql-data
spec:
  accessModes:
  - ReadWriteMany   # RWX
  resources:
    requests:
      storage: 20Gi
  storageClassName: nfs-client

Network Troubleshooting Tools and Methods

Service connectivity tests:

# Service details
kubectl get svc -o wide
kubectl describe svc SERVICE_NAME
# Endpoints
kubectl get endpoints SERVICE_NAME
# Test from inside a pod
kubectl exec -it POD_NAME -- curl SERVICE_NAME:PORT
kubectl exec -it POD_NAME -- nslookup SERVICE_NAME
# Cross‑namespace access
kubectl exec -it POD_NAME -- curl SERVICE_NAME.NAMESPACE.svc.cluster.local

NetworkPolicy inspection:

# List and describe policies
kubectl get networkpolicies -n NAMESPACE
kubectl describe networkpolicy POLICY_NAME

DNS troubleshooting:

# CoreDNS pod status
kubectl get pods -n kube-system -l k8s-app=kube-dns
# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Test DNS from a pod
kubectl exec -it POD_NAME -- nslookup kubernetes.default

Practical Cases

Case 1 – CrashLoopBackOff

Symptom

$ kubectl get pods
NAME                     READY   STATUS            RESTARTS   AGE
webapp-deployment-7d8f9c 0/1     CrashLoopBackOff 5          3m

Investigation

Describe pod – notice BackOff warning.

Check logs – panic due to database connection refusal.

Inspect exit code – 2 (application error).

Root cause : Application cannot reach MySQL service.

Solution

Verify MySQL service and endpoints.

# Service
kubectl get svc mysql-service
# Endpoints
kubectl get endpoints mysql-service

Update deployment to use correct service name and add health probes.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      containers:
      - name: webapp
        image: myapp:v1.0
        env:
        - name: DB_HOST
          value: "mysql-service"
        - name: DB_RETRY_INTERVAL
          value: "5"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

Apply changes and confirm pods run.

# Apply
kubectl apply -f webapp-deployment.yaml
# Watch pods
kubectl get pods -w

Case 2 – ImagePullBackOff

Symptom

$ kubectl get pods
NAME        READY   STATUS          RESTARTS   AGE
nginx-app-5d7f8b 0/1   ImagePullBackOff 0   2m

Investigation

Describe pod – error shows pull access denied for private registry.

Solution

Create a Docker registry secret.

kubectl create secret docker-registry harbor-secret \
  --docker-server=harbor.company.com \
  --docker-username=admin \
  --docker-password=Harbor12345 \
  [email protected] -n default

Reference the secret in the Deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      imagePullSecrets:
      - name: harbor-secret
      containers:
      - name: nginx
        image: harbor.company.com/prod/nginx:v2.0
        ports:
        - containerPort: 80

Patch the default ServiceAccount to use the secret.

kubectl patch serviceaccount default -p '{"imagePullSecrets": [{"name": "harbor-secret"}]}'

After applying, the pod reaches Running state.

Case 3 – Service Unreachable (Network Fault)

Symptom

# From a client pod
kubectl exec -it client-pod -- curl backend-service:8080
curl: (6) Could not resolve host: backend-service

Investigation

Service has no endpoints.

kubectl get svc backend-service -o wide
kubectl describe svc backend-service

Pod labels do not match Service selector (app=backend vs app=backend-app).

Resolution

Fix Service selector to match pod labels.

apiVersion: v1
kind: Service
metadata:
  name: backend-service
spec:
  selector:
    app: backend-app   # corrected
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 8080

Or adjust pod labels to app=backend.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: backend-deploy
spec:
  selector:
    matchLabels:
      app: backend
  template:
    metadata:
      labels:
        app: backend
    spec:
      containers:
      - name: backend
        image: backend:v1.0
        ports:
        - containerPort: 8080

Verification shows endpoints populated and curl succeeds.

Case 4 – Node NotReady

Symptom

$ kubectl get nodes
NAME    STATUS    ROLES   AGE   VERSION
node-2  NotReady  worker  30d   v1.24.0

Investigation

Describe node – DiskPressure true, container runtime not ready, network plugin not ready.

Check kubelet and containerd services; containerd is failed.

Disk usage > 95% on root and containerd directory.

Resolution Steps

Clean disk space (prune images, remove old containers, delete old logs).

# Prune unused images
crictl rmi --prune
# Remove stopped containers
crictl rm $(crictl ps -a -q --state=Exited)
# Delete old pod logs
find /var/log/pods -name "*.log" -mtime +7 -delete
journalctl --vacuum-time=7d

Restart containerd and kubelet.

systemctl restart containerd
systemctl restart kubelet

Verify node status returns to Ready.

kubectl get nodes
kubectl describe node node-2 | grep -A 5 Conditions

Case 5 – Scheduling Failure Due to Resource Shortage

Symptom

$ kubectl get pods
java-app-deployment-7f8d 0/1 Pending 0 5m

Investigation

Events show Insufficient cpu/memory on all nodes.

Pod requests 4Gi memory, 2000m CPU.

Node allocated resources indicate no node has enough free memory.

Solutions

Reduce resource requests (recommended).

resources:
  requests:
    memory: "2Gi"
    cpu: "1000m"
  limits:
    memory: "4Gi"
    cpu: "2000m"

Scale out the cluster (e.g., add nodes via cloud provider).

# Example for Alibaba Cloud
aliyun cs ScaleOutCluster --ClusterId=c1234567890 --count=2 --worker-instance-types=ecs.g6.2xlarge

Free up resources: delete unused pods, reduce replica counts.

# Find high‑memory pods
kubectl top pods -A --sort-by=memory | head -20
# Delete unnecessary pod
kubectl delete pod UNUSED_POD -n NAMESPACE
# Scale deployment
kubectl scale deployment DEPLOYMENT_NAME --replicas=1

Case 6 – PVC Mount Failure (Multi‑Attach Error)

Symptom

$ kubectl get pods
mysql-statefulset-0 0/1 ContainerCreating 0 3m

Investigation

Describe pod – warning: FailedAttachVolume, volume already attached to another node.

PVC is bound to a RWO volume.

PV shows node affinity to a specific zone.

Resolution Options

Force delete the old pod that still holds the volume.

kubectl delete pod mysql-statefulset-0 --grace-period=0 --force

Manually unmount the volume on the node (use with caution).

# SSH to node
ssh root@node-2
# Find mount point
mount | grep pvc-abc123
# Unmount
umount /var/lib/kubelet/pods/.../volumes/kubernetes.io~aws-ebs/pvc-abc123

Switch the PVC to ReadWriteMany if the storage class supports it.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mysql-data
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 20Gi
  storageClassName: nfs-client

Best Practices

Monitoring and Alerting

Prometheus + Grafana rule example for pod restarts and not‑ready pods:

# alerts.yaml
groups:
- name: kubernetes-pods
  rules:
  - alert: PodRestartingTooOften
    expr: rate(kube_pod_container_status_restarts_total[1h]) > 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarts too frequently"
  - alert: PodNotReady
    expr: kube_pod_status_phase{phase!~"Running|Succeeded"} > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is not ready"
- alert: NodeNotReady
  expr: kube_node_status_condition{condition="Ready",status="true"} == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Node {{ $labels.node }} NotReady"
- alert: NodeDiskPressure
  expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Node {{ $labels.node }} under disk pressure"

Log Collection (EFK Stack)

Fluentd DaemonSet example:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: fluentd
  template:
    metadata:
      labels:
        name: fluentd
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch.logging.svc.cluster.local"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

Toolbox

kubectl-debug – install and use for on‑the‑fly debugging.

# Install
curl -Lo kubectl-debug.tar.gz https://github.com/aylei/kubectl-debug/releases/download/v0.1.1/kubectl-debug_0.1.1_linux_amd64.tar.gz
tar -zxvf kubectl-debug.tar.gz kubectl-debug
mv kubectl-debug /usr/local/bin/
# Debug a pod
kubectl debug POD_NAME --agentless --port-forward=true

stern – tail logs from multiple pods.

# Install
wget https://github.com/stern/stern/releases/download/v1.22.0/stern_1.22.0_linux_amd64.tar.gz
tar -zxvf stern_1.22.0_linux_amd64.tar.gz
mv stern /usr/local/bin/
# Example usage
stern -n production backend-*
stern -l app=nginx

netshoot – a network‑debug pod.

apiVersion: v1
kind: Pod
metadata:
  name: netshoot
spec:
  containers:
  - name: netshoot
    image: nicolaka/netshoot
    command: ["sleep", "3600"]

Preventive Measures

ResourceQuota to cap CPU, memory, PVC count.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: production
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 200Gi
    limits.cpu: "200"
    limits.memory: 400Gi
    persistentvolumeclaims: "50"

PodDisruptionBudget to guarantee availability.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: backend-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: backend

Robust liveness, readiness and startup probes.

apiVersion: v1
kind: Pod
metadata:
  name: webapp
spec:
  containers:
  - name: app
    image: webapp:v1.0
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 60
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 3
      failureThreshold: 3
    startupProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 0
      periodSeconds: 10
      failureThreshold: 30   # allows up to 5 min start time

Daily health‑check script.

#!/bin/bash
# k8s_health_check.sh

echo "=== Node Status ==="
kubectl get nodes -o wide

echo -e "
=== Abnormal Pods ==="
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded

echo -e "
=== Pods with High Restarts ==="
kubectl get pods -A -o json | jq -r '.items[] | select(.status.containerStatuses[]?.restartCount > 5) | "\(.metadata.namespace)/\(.metadata.name) - Restarts: \(.status.containerStatuses[0].restartCount)"'

echo -e "
=== Top Resource Consumers ==="
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -10

echo -e "
=== Recent Warning Events ==="
kubectl get events -A --sort-by='.lastTimestamp' | grep -i "warning\|error" | tail -20

Conclusion and Outlook

Kubernetes troubleshooting is a systematic engineering discipline that requires a solid grasp of architecture, hands‑on command proficiency, and proactive monitoring. By mastering the three‑step method, leveraging the presented commands, and applying the best‑practice checklist, operators can quickly locate and resolve issues while building preventive safeguards.

Systematic workflow: information collection → log analysis → deep diagnosis.

Tool mastery: kubectl, describe, logs, events, top, and auxiliary tools like stern and netshoot.

Understanding of underlying mechanisms: pod scheduling, network model, storage binding.

Established monitoring: Prometheus alerts + EFK log pipeline.

Preventive mindset: resource quotas, health probes, PDBs, regular health‑check scripts.

Future trends shaping Kubernetes operations include AIOps for predictive fault detection, eBPF‑based deep observability (Cilium, Pixie), service‑mesh enhancements (Istio, Linkerd), GitOps workflows (Argo CD, Flux) for declarative configuration and automated rollbacks, and edge‑native extensions (KubeEdge) expanding Kubernetes to edge devices.

NetworkTroubleshootingStoragePod
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.