Cloud Native 36 min read

Master Kubernetes Troubleshooting: From CrashLoopBackOff to Network Failures

This comprehensive guide walks you through Kubernetes fault diagnosis, covering pod lifecycle issues, resource scheduling, network communication errors, storage mounting problems, and node failures, with step‑by‑step methodologies, essential kubectl commands, real‑world case studies, and best‑practice recommendations to quickly identify and resolve production incidents.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Master Kubernetes Troubleshooting: From CrashLoopBackOff to Network Failures

K8s Troubleshooting Handbook: Complete Solutions from Pod Crashes to Network Anomalies

1. Introduction

In the cloud‑native era, Kubernetes has become the de‑facto standard for container orchestration, but its complex architecture brings unprecedented operational challenges. From frequent pod restarts to inter‑service communication failures, scheduling issues to storage mount problems, each fault can affect business stability. Statistics show that over 60% of production K8s incidents stem from configuration errors and resource mismanagement.

This guide adopts a hands‑on approach, systematically outlining Kubernetes troubleshooting methodology across pod lifecycle, networking, and storage management, and demonstrates six real‑world cases. Whether you are a beginner or a seasoned SRE, this manual equips you with the skills to quickly locate and fix issues while building preventive operational thinking.

2. Technical Background

2.1 Review of Key K8s Components

Kubernetes follows a classic Master‑Worker architecture; understanding component interactions is the foundation of troubleshooting:

Control‑plane components:

kube-apiserver : entry point for all operations, handles REST requests

etcd : sole storage for cluster state data

kube-scheduler : makes pod scheduling decisions

kube-controller-manager : manages controllers such as ReplicaSet and Deployment

cloud‑controller‑manager : integrates cloud provider APIs

Node components:

kubelet : node agent that manages pod lifecycle

kube-proxy : maintains network rules to implement Service abstraction

Container runtime : e.g., Docker, containerd, CRI‑O

2.2 Common Fault Types

Pod status anomalies : Pending, CrashLoopBackOff, ImagePullBackOff, Error

Resource scheduling problems : insufficient resources, affinity conflicts, taint‑toleration mismatches

Network communication faults : Service unreachable, DNS failures, cross‑node connectivity issues

Storage mount issues : PVC binding failures, mount timeouts, permission errors

Node‑level failures : NotReady, disk pressure, memory shortage

Configuration errors : YAML syntax mistakes, RBAC insufficiency, improper resource limits

2.3 Fault Diagnosis Methodology

Three‑step diagnosis:

Step 1: Information collection

# View resource status
kubectl get pods -o wide
kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'

Step 2: Log analysis

# View container logs
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container-name>
kubectl logs <pod-name> --previous  # previous crash container logs

Step 3: Deep diagnosis

# Exec into container for investigation
kubectl exec -it <pod-name> -- /bin/sh
# View node status
kubectl describe node <node-name>

3. Core Content

3.1 Pod Lifecycle and Status Interpretation

Pod lifecycle consists of several phases, each with specific meaning:

Key status explanations:

Pending : waiting for scheduling or resources (e.g., insufficient CPU, image pulling, storage not ready)

Running : normal operation

Succeeded : job/completed successfully

Failed : execution failed (non‑zero exit code)

Unknown : unable to obtain status (node communication issue)

CrashLoopBackOff : repeatedly crashing and restarting (application start failure, health‑check failure)

ImagePullBackOff : image pull failure (image missing, authentication error, network issue)

Container status inspection:

# Detailed pod status
kubectl get pod <pod-name> -o yaml | grep -A 10 status
# Container restart count
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[*].restartCount}'
# Container readiness
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[*].ready}'

3.2 Common Diagnostic Commands

Basic information retrieval:

# List pods in all namespaces
kubectl get pods -A
# Wide view (IP, node, start time)
kubectl get pods -o wide -n <namespace>
# Full YAML of a pod
kubectl get pod <pod-name> -o yaml
# Detailed description (most used for troubleshooting)
kubectl describe pod <pod-name> -n <namespace>

Log viewing tricks:

# Tail last 100 lines
kubectl logs <pod-name> --tail=100
# Follow logs (like tail -f)
kubectl logs -f <pod-name>
# Specify container in multi‑container pod
kubectl logs <pod-name> -c <container-name>
# View all containers
kubectl logs <pod-name> --all-containers=true
# View previous crash logs
kubectl logs <pod-name> --previous
# Add timestamps
kubectl logs <pod-name> --timestamps=true
# Logs from the last hour
kubectl logs <pod-name> --since=1h

Event inspection:

# Cluster events sorted by time
kubectl get events --sort-by='.lastTimestamp'
# Namespace‑specific events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Events related to a specific pod
kubectl get events --field-selector involvedObject.name=<pod-name>
# Warning‑level events only
kubectl get events --field-selector type=Warning

Resource usage view:

# Node resource usage (requires metrics‑server)
kubectl top nodes
# Pod resource usage
kubectl top pods -n <namespace>
# Specific pod resource usage per container
kubectl top pod <pod-name> --containers

3.3 Log Analysis Techniques

Key points:

Check container exit codes

kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

0: normal exit

1: application error

137: OOMKilled

143: SIGTERM (graceful stop)

255: exit code out of range

Detect OOMKilled

kubectl describe pod <pod-name> | grep -i "OOMKilled"
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

Log aggregation queries (e.g., using stern)

# View logs of multiple pods with same label
stern -l app=nginx
# Tail last 50 lines of all pods matching a prefix
kubectl logs -l app=nginx --tail=50

3.4 Resource Limits and Scheduling Issues

Resource configuration example:

apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
  - name: app
    image: nginx
    resources:
      requests:
        memory: "256Mi"
        cpu: "250m"
      limits:
        memory: "512Mi"
        cpu: "500m"

Scheduling investigation commands:

# Show why pod failed to schedule
kubectl describe pod <pod-name> | grep -A 5 "Events"
# View node available resources
kubectl describe nodes | grep -A 5 "Allocated resources"
# Show node labels
kubectl get nodes --show-labels
# Show node taints
kubectl describe node <node-name> | grep Taints

Common scheduling failure reasons:

# Insufficient memory
0/3 nodes are available: 3 Insufficient memory.
# Node selector mismatch
0/3 nodes are available: 3 node(s) don't match node selector.
# Taint‑toleration mismatch
0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.

3.5 Storage and Persistence Issues

PVC status view:

# List PVCs
kubectl get pvc -n <namespace>
# List PVs
kubectl get pv
# Detailed PVC info
kubectl describe pvc <pvc-name>
# List storage classes
kubectl get storageclass

Storage configuration example:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mysql-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  storageClassName: standard

3.6 Network Diagnosis Tools and Methods

Service connectivity test:

# Service details
kubectl get svc -o wide
kubectl describe svc <service-name>
# Endpoints
kubectl get endpoints <service-name>
# Test from inside a pod
kubectl exec -it <pod-name> -- curl <service-name>:<port>
kubectl exec -it <pod-name> -- nslookup <service-name>
# Cross‑namespace access
kubectl exec -it <pod-name> -- curl <service-name>.<namespace>.svc.cluster.local

NetworkPolicy view:

kubectl get networkpolicies -n <namespace>
kubectl describe networkpolicy <policy-name>

DNS troubleshooting:

# CoreDNS status
kubectl get pods -n kube-system -l k8s-app=kube-dns
# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Test DNS from pod
kubectl exec -it <pod-name> -- nslookup kubernetes.default

4. Practical Cases

Case 1: CrashLoopBackOff Diagnosis and Resolution

Symptom:

$ kubectl get pods
NAME                     READY   STATUS            RESTARTS   AGE
webapp-deployment-7d8f9c 0/1     CrashLoopBackOff  5          3m

Steps:

Inspect pod details

$ kubectl describe pod webapp-deployment-7d8f9c

Check container logs

$ kubectl logs webapp-deployment-7d8f9c
panic: Failed to connect to database: dial tcp 10.0.1.100:3306: connect: connection refused

Check exit code

$ kubectl get pod webapp-deployment-7d8f9c -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
2

Analysis: Application cannot connect to database; exit code 2 indicates application error.

Solution:

Verify database service is reachable

$ kubectl get svc mysql-service
$ kubectl get endpoints mysql-service

Update deployment with health checks and retry logic

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      containers:
      - name: webapp
        image: myapp:v1.0
        env:
        - name: DB_HOST
          value: "mysql-service"
        - name: DB_RETRY_INTERVAL
          value: "5"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

Apply and verify

$ kubectl apply -f webapp-deployment.yaml
$ kubectl get pods -w

Case 2: ImagePullBackOff Failure

Symptom:

$ kubectl get pods
NAME        READY   STATUS          RESTARTS   AGE
nginx-app-5d7f8b 0/1 ImagePullBackOff 0 2m

Steps:

Inspect detailed error

$ kubectl describe pod nginx-app-5d7f8b
... Failed to pull image "harbor.company.com/prod/nginx:v2.0": rpc error: code = Unknown desc = Error response from daemon: pull access denied for harbor.company.com/prod/nginx, repository does not exist or may require 'docker login'

Create Docker registry secret

$ kubectl create secret docker-registry harbor-secret \
  --docker-server=harbor.company.com \
  --docker-username=admin \
  --docker-password=Harbor12345 \
  [email protected] -n default

Reference secret in deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      imagePullSecrets:
      - name: harbor-secret
      containers:
      - name: nginx
        image: harbor.company.com/prod/nginx:v2.0
        ports:
        - containerPort: 80

Apply and verify

$ kubectl apply -f nginx-deployment.yaml
$ kubectl get pods
NAME        READY   STATUS    RESTARTS   AGE
nginx-app-7c8d9f 1/1 Running 0 30s

Case 3: Service Unreachable Network Fault

Symptom: curl from client pod fails to resolve backend-service.

$ kubectl exec -it client-pod -- curl backend-service:8080
curl: (6) Could not resolve host: backend-service

Investigation:

Check Service definition – Endpoints empty

$ kubectl get svc backend-service -o wide
$ kubectl describe svc backend-service
Endpoints: <none>

Verify pod labels do not match Service selector

$ kubectl get pods -l app=backend
No resources found
$ kubectl get pods --show-labels
backend-deploy-5f6c7d 1/1 Running app=backend-app,version=v1

Fix selector or pod labels. Example fixing Service selector:

apiVersion: v1
kind: Service
metadata:
  name: backend-service
spec:
  selector:
    app: backend-app
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 8080

Apply and verify connectivity

$ kubectl apply -f backend-service.yaml
$ kubectl get endpoints backend-service
NAME            ENDPOINTS                         AGE
backend-service 10.244.1.10:8080,10.244.2.15:8080 1m
$ kubectl exec -it client-pod -- curl backend-service:8080
{"status":"ok","version":"v1.0"}

Case 4: Node NotReady Diagnosis

Node shows NotReady due to DiskPressure and network plugin not ready.

$ kubectl describe node node-2
Conditions:
  DiskPressure   True   KubeletHasDiskPressure   kubelet has disk pressure
  Ready          False  KubeletNotReady   container runtime not ready: RuntimeReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready
Events:
  Warning  ContainerRuntimeUnhealthy 5m kubelet container runtime is down: failed to connect to containerd

Resolution steps:

Clean disk space (prune images, logs)

# Clean unused images
crictl rmi --prune
# Clean stopped containers
crictl rm $(crictl ps -a -q --state=Exited)
# Clean old logs
find /var/log/pods -name "*.log" -mtime +7 -delete
journalctl --vacuum-time=7d

Restart containerd and kubelet

systemctl restart containerd
systemctl restart kubelet

Verify node status returns to Ready.

Case 5: Scheduling Failure Due to Resource Shortage

Pod remains Pending because requested memory exceeds node capacity.

$ kubectl describe pod java-app-deployment-7f8d
Events:
  Warning  FailedScheduling 5m default-scheduler 0/3 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.

Solutions:

Reduce resource requests in deployment. Scale out cluster nodes. Clean up high‑resource pods.

5. Best Practices

5.1 Monitoring and Alerting Configuration

Prometheus + Grafana monitoring stack with key alerts:

# Pod restart alert
- alert: PodRestartingTooOften
  expr: rate(kube_pod_container_status_restarts_total[1h]) > 3
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarting too frequently"
# Pod not ready alert
- alert: PodNotReady
  expr: kube_pod_status_phase{phase!~"Running|Succeeded"} > 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} in abnormal state"
# Node not ready alert
- alert: NodeNotReady
  expr: kube_node_status_condition{condition="Ready",status="true"} == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Node {{ $labels.node }} NotReady"

5.2 Log Collection Solution

EFK stack (Elasticsearch, Fluentd, Kibana) with Fluentd DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: fluentd
  template:
    metadata:
      labels:
        name: fluentd
    spec:
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
        env:
        - name: FLUENT_ELASTICSEARCH_HOST
          value: "elasticsearch.logging.svc.cluster.local"
        - name: FLUENT_ELASTICSEARCH_PORT
          value: "9200"
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers

5.3 Troubleshooting Toolbox

kubectl plugins (e.g., kubectl‑debug)

# Install kubectl‑debug
curl -Lo kubectl-debug.tar.gz https://github.com/aylei/kubectl-debug/releases/download/v0.1.1/kubectl-debug_0.1.1_linux_amd64.tar.gz
tar -zxvf kubectl-debug.tar.gz kubectl-debug
mv kubectl-debug /usr/local/bin/
# Usage example
kubectl debug <pod-name> --agentless --port-forward=true

stern for multi‑pod log aggregation

# Install stern
wget https://github.com/stern/stern/releases/download/v1.22.0/stern_1.22.0_linux_amd64.tar.gz
tar -zxvf stern_1.22.0_linux_amd64.tar.gz
mv stern /usr/local/bin/
# View logs of all backend pods
stern -n production backend-*

netshoot pod for network debugging

apiVersion: v1
kind: Pod
metadata:
  name: netshoot
spec:
  containers:
  - name: netshoot
    image: nicolaka/netshoot
    command: ["sleep", "3600"]

5.4 Preventive Measures

Resource quotas:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: production
spec:
  hard:
    requests.cpu: "100"
    requests.memory: 200Gi
    limits.cpu: "200"
    limits.memory: 400Gi
    persistentvolumeclaims: "50"

Pod Disruption Budget (PDB):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: backend-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: backend

Health‑check best practices:

apiVersion: v1
kind: Pod
metadata:
  name: webapp
spec:
  containers:
  - name: app
    image: webapp:v1.0
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 60
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5
    startupProbe:
      httpGet:
        path: /healthz
        port: 8080
      initialDelaySeconds: 0
      periodSeconds: 10
      failureThreshold: 30

Daily inspection script (k8s_health_check.sh):

#!/bin/bash
# Node status check
echo "=== Node Status ==="
kubectl get nodes -o wide
# Abnormal pods
echo -e "
=== Abnormal Pods ==="
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
# Pods with high restarts
echo -e "
=== Pods with High Restarts ==="
kubectl get pods -A -o json | jq -r '.items[] | select(.status.containerStatuses[]?.restartCount > 5) | "\(.metadata.namespace)/\(.metadata.name) - Restarts: \(.status.containerStatuses[0].restartCount)"'
# Top resource usage
echo -e "
=== Top Resource Consumers ==="
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -10
# Recent warning events
echo -e "
=== Recent Warning Events ==="
kubectl get events -A --sort-by='.lastTimestamp' | grep -i "warning\|error" | tail -20

6. Summary and Outlook

Kubernetes fault diagnosis is a systematic engineering effort that requires comprehensive knowledge from infrastructure to hands‑on techniques. This manual has covered pod lifecycle, resource scheduling, network communication, storage management, and provided six real‑world cases demonstrating complete problem identification and resolution workflows.

Key takeaways:

Systematic troubleshooting flow: information collection → log analysis → deep diagnosis.

Proficiency with core tools: kubectl, describe, logs, events.

Understanding underlying mechanisms: pod scheduling, networking model, storage binding.

Establishing monitoring: Prometheus metrics, EFK logs, alert rules.

Preventive measures: resource quotas, health checks, PDB, regular inspections.

Future trends:

AIOps for predictive fault detection and automated remediation.

eBPF‑based deep observability (Cilium, Pixie).

Service‑mesh enhancements (Istio, Linkerd) for stronger traffic control and isolation.

GitOps operational model (Argo CD, Flux) for declarative configuration and automated rollbacks.

Edge‑computing extensions (KubeEdge) bringing K8s capabilities to edge nodes.

Continuous learning is essential for SREs. Follow the Kubernetes official blog, CNCF project updates, and engage in community discussions. Treat each incident as an opportunity to improve, document solutions, and automate repetitive tasks. Mastering Kubernetes troubleshooting empowers you to keep services reliable and resilient in the cloud‑native era.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringcloud-nativeKubernetesDevOps
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.