Master Kubernetes Troubleshooting: From CrashLoopBackOff to Network Failures
This comprehensive guide walks you through Kubernetes fault diagnosis, covering pod lifecycle issues, resource scheduling, network communication errors, storage mounting problems, and node failures, with step‑by‑step methodologies, essential kubectl commands, real‑world case studies, and best‑practice recommendations to quickly identify and resolve production incidents.
K8s Troubleshooting Handbook: Complete Solutions from Pod Crashes to Network Anomalies
1. Introduction
In the cloud‑native era, Kubernetes has become the de‑facto standard for container orchestration, but its complex architecture brings unprecedented operational challenges. From frequent pod restarts to inter‑service communication failures, scheduling issues to storage mount problems, each fault can affect business stability. Statistics show that over 60% of production K8s incidents stem from configuration errors and resource mismanagement.
This guide adopts a hands‑on approach, systematically outlining Kubernetes troubleshooting methodology across pod lifecycle, networking, and storage management, and demonstrates six real‑world cases. Whether you are a beginner or a seasoned SRE, this manual equips you with the skills to quickly locate and fix issues while building preventive operational thinking.
2. Technical Background
2.1 Review of Key K8s Components
Kubernetes follows a classic Master‑Worker architecture; understanding component interactions is the foundation of troubleshooting:
Control‑plane components:
kube-apiserver : entry point for all operations, handles REST requests
etcd : sole storage for cluster state data
kube-scheduler : makes pod scheduling decisions
kube-controller-manager : manages controllers such as ReplicaSet and Deployment
cloud‑controller‑manager : integrates cloud provider APIs
Node components:
kubelet : node agent that manages pod lifecycle
kube-proxy : maintains network rules to implement Service abstraction
Container runtime : e.g., Docker, containerd, CRI‑O
2.2 Common Fault Types
Pod status anomalies : Pending, CrashLoopBackOff, ImagePullBackOff, Error
Resource scheduling problems : insufficient resources, affinity conflicts, taint‑toleration mismatches
Network communication faults : Service unreachable, DNS failures, cross‑node connectivity issues
Storage mount issues : PVC binding failures, mount timeouts, permission errors
Node‑level failures : NotReady, disk pressure, memory shortage
Configuration errors : YAML syntax mistakes, RBAC insufficiency, improper resource limits
2.3 Fault Diagnosis Methodology
Three‑step diagnosis:
Step 1: Information collection
# View resource status
kubectl get pods -o wide
kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'Step 2: Log analysis
# View container logs
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container-name>
kubectl logs <pod-name> --previous # previous crash container logsStep 3: Deep diagnosis
# Exec into container for investigation
kubectl exec -it <pod-name> -- /bin/sh
# View node status
kubectl describe node <node-name>3. Core Content
3.1 Pod Lifecycle and Status Interpretation
Pod lifecycle consists of several phases, each with specific meaning:
Key status explanations:
Pending : waiting for scheduling or resources (e.g., insufficient CPU, image pulling, storage not ready)
Running : normal operation
Succeeded : job/completed successfully
Failed : execution failed (non‑zero exit code)
Unknown : unable to obtain status (node communication issue)
CrashLoopBackOff : repeatedly crashing and restarting (application start failure, health‑check failure)
ImagePullBackOff : image pull failure (image missing, authentication error, network issue)
Container status inspection:
# Detailed pod status
kubectl get pod <pod-name> -o yaml | grep -A 10 status
# Container restart count
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[*].restartCount}'
# Container readiness
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[*].ready}'3.2 Common Diagnostic Commands
Basic information retrieval:
# List pods in all namespaces
kubectl get pods -A
# Wide view (IP, node, start time)
kubectl get pods -o wide -n <namespace>
# Full YAML of a pod
kubectl get pod <pod-name> -o yaml
# Detailed description (most used for troubleshooting)
kubectl describe pod <pod-name> -n <namespace>Log viewing tricks:
# Tail last 100 lines
kubectl logs <pod-name> --tail=100
# Follow logs (like tail -f)
kubectl logs -f <pod-name>
# Specify container in multi‑container pod
kubectl logs <pod-name> -c <container-name>
# View all containers
kubectl logs <pod-name> --all-containers=true
# View previous crash logs
kubectl logs <pod-name> --previous
# Add timestamps
kubectl logs <pod-name> --timestamps=true
# Logs from the last hour
kubectl logs <pod-name> --since=1hEvent inspection:
# Cluster events sorted by time
kubectl get events --sort-by='.lastTimestamp'
# Namespace‑specific events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# Events related to a specific pod
kubectl get events --field-selector involvedObject.name=<pod-name>
# Warning‑level events only
kubectl get events --field-selector type=WarningResource usage view:
# Node resource usage (requires metrics‑server)
kubectl top nodes
# Pod resource usage
kubectl top pods -n <namespace>
# Specific pod resource usage per container
kubectl top pod <pod-name> --containers3.3 Log Analysis Techniques
Key points:
Check container exit codes
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'0: normal exit
1: application error
137: OOMKilled
143: SIGTERM (graceful stop)
255: exit code out of range
Detect OOMKilled
kubectl describe pod <pod-name> | grep -i "OOMKilled"
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'Log aggregation queries (e.g., using stern)
# View logs of multiple pods with same label
stern -l app=nginx
# Tail last 50 lines of all pods matching a prefix
kubectl logs -l app=nginx --tail=503.4 Resource Limits and Scheduling Issues
Resource configuration example:
apiVersion: v1
kind: Pod
metadata:
name: resource-demo
spec:
containers:
- name: app
image: nginx
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"Scheduling investigation commands:
# Show why pod failed to schedule
kubectl describe pod <pod-name> | grep -A 5 "Events"
# View node available resources
kubectl describe nodes | grep -A 5 "Allocated resources"
# Show node labels
kubectl get nodes --show-labels
# Show node taints
kubectl describe node <node-name> | grep TaintsCommon scheduling failure reasons:
# Insufficient memory
0/3 nodes are available: 3 Insufficient memory.
# Node selector mismatch
0/3 nodes are available: 3 node(s) don't match node selector.
# Taint‑toleration mismatch
0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.3.5 Storage and Persistence Issues
PVC status view:
# List PVCs
kubectl get pvc -n <namespace>
# List PVs
kubectl get pv
# Detailed PVC info
kubectl describe pvc <pvc-name>
# List storage classes
kubectl get storageclassStorage configuration example:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mysql-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: standard3.6 Network Diagnosis Tools and Methods
Service connectivity test:
# Service details
kubectl get svc -o wide
kubectl describe svc <service-name>
# Endpoints
kubectl get endpoints <service-name>
# Test from inside a pod
kubectl exec -it <pod-name> -- curl <service-name>:<port>
kubectl exec -it <pod-name> -- nslookup <service-name>
# Cross‑namespace access
kubectl exec -it <pod-name> -- curl <service-name>.<namespace>.svc.cluster.localNetworkPolicy view:
kubectl get networkpolicies -n <namespace>
kubectl describe networkpolicy <policy-name>DNS troubleshooting:
# CoreDNS status
kubectl get pods -n kube-system -l k8s-app=kube-dns
# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
# Test DNS from pod
kubectl exec -it <pod-name> -- nslookup kubernetes.default4. Practical Cases
Case 1: CrashLoopBackOff Diagnosis and Resolution
Symptom:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
webapp-deployment-7d8f9c 0/1 CrashLoopBackOff 5 3mSteps:
Inspect pod details
$ kubectl describe pod webapp-deployment-7d8f9cCheck container logs
$ kubectl logs webapp-deployment-7d8f9c
panic: Failed to connect to database: dial tcp 10.0.1.100:3306: connect: connection refusedCheck exit code
$ kubectl get pod webapp-deployment-7d8f9c -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
2Analysis: Application cannot connect to database; exit code 2 indicates application error.
Solution:
Verify database service is reachable
$ kubectl get svc mysql-service
$ kubectl get endpoints mysql-serviceUpdate deployment with health checks and retry logic
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
replicas: 3
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
containers:
- name: webapp
image: myapp:v1.0
env:
- name: DB_HOST
value: "mysql-service"
- name: DB_RETRY_INTERVAL
value: "5"
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5Apply and verify
$ kubectl apply -f webapp-deployment.yaml
$ kubectl get pods -wCase 2: ImagePullBackOff Failure
Symptom:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-app-5d7f8b 0/1 ImagePullBackOff 0 2mSteps:
Inspect detailed error
$ kubectl describe pod nginx-app-5d7f8b
... Failed to pull image "harbor.company.com/prod/nginx:v2.0": rpc error: code = Unknown desc = Error response from daemon: pull access denied for harbor.company.com/prod/nginx, repository does not exist or may require 'docker login'Create Docker registry secret
$ kubectl create secret docker-registry harbor-secret \
--docker-server=harbor.company.com \
--docker-username=admin \
--docker-password=Harbor12345 \
[email protected] -n defaultReference secret in deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-app
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
imagePullSecrets:
- name: harbor-secret
containers:
- name: nginx
image: harbor.company.com/prod/nginx:v2.0
ports:
- containerPort: 80Apply and verify
$ kubectl apply -f nginx-deployment.yaml
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-app-7c8d9f 1/1 Running 0 30sCase 3: Service Unreachable Network Fault
Symptom: curl from client pod fails to resolve backend-service.
$ kubectl exec -it client-pod -- curl backend-service:8080
curl: (6) Could not resolve host: backend-serviceInvestigation:
Check Service definition – Endpoints empty
$ kubectl get svc backend-service -o wide
$ kubectl describe svc backend-service
Endpoints: <none>Verify pod labels do not match Service selector
$ kubectl get pods -l app=backend
No resources found
$ kubectl get pods --show-labels
backend-deploy-5f6c7d 1/1 Running app=backend-app,version=v1Fix selector or pod labels. Example fixing Service selector:
apiVersion: v1
kind: Service
metadata:
name: backend-service
spec:
selector:
app: backend-app
ports:
- protocol: TCP
port: 8080
targetPort: 8080Apply and verify connectivity
$ kubectl apply -f backend-service.yaml
$ kubectl get endpoints backend-service
NAME ENDPOINTS AGE
backend-service 10.244.1.10:8080,10.244.2.15:8080 1m
$ kubectl exec -it client-pod -- curl backend-service:8080
{"status":"ok","version":"v1.0"}Case 4: Node NotReady Diagnosis
Node shows NotReady due to DiskPressure and network plugin not ready.
$ kubectl describe node node-2
Conditions:
DiskPressure True KubeletHasDiskPressure kubelet has disk pressure
Ready False KubeletNotReady container runtime not ready: RuntimeReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready
Events:
Warning ContainerRuntimeUnhealthy 5m kubelet container runtime is down: failed to connect to containerdResolution steps:
Clean disk space (prune images, logs)
# Clean unused images
crictl rmi --prune
# Clean stopped containers
crictl rm $(crictl ps -a -q --state=Exited)
# Clean old logs
find /var/log/pods -name "*.log" -mtime +7 -delete
journalctl --vacuum-time=7dRestart containerd and kubelet
systemctl restart containerd
systemctl restart kubeletVerify node status returns to Ready.
Case 5: Scheduling Failure Due to Resource Shortage
Pod remains Pending because requested memory exceeds node capacity.
$ kubectl describe pod java-app-deployment-7f8d
Events:
Warning FailedScheduling 5m default-scheduler 0/3 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.Solutions:
Reduce resource requests in deployment. Scale out cluster nodes. Clean up high‑resource pods.
5. Best Practices
5.1 Monitoring and Alerting Configuration
Prometheus + Grafana monitoring stack with key alerts:
# Pod restart alert
- alert: PodRestartingTooOften
expr: rate(kube_pod_container_status_restarts_total[1h]) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarting too frequently"
# Pod not ready alert
- alert: PodNotReady
expr: kube_pod_status_phase{phase!~"Running|Succeeded"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} in abnormal state"
# Node not ready alert
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} NotReady"5.2 Log Collection Solution
EFK stack (Elasticsearch, Fluentd, Kibana) with Fluentd DaemonSet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: kube-system
spec:
selector:
matchLabels:
name: fluentd
template:
metadata:
labels:
name: fluentd
spec:
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers5.3 Troubleshooting Toolbox
kubectl plugins (e.g., kubectl‑debug)
# Install kubectl‑debug
curl -Lo kubectl-debug.tar.gz https://github.com/aylei/kubectl-debug/releases/download/v0.1.1/kubectl-debug_0.1.1_linux_amd64.tar.gz
tar -zxvf kubectl-debug.tar.gz kubectl-debug
mv kubectl-debug /usr/local/bin/
# Usage example
kubectl debug <pod-name> --agentless --port-forward=truestern for multi‑pod log aggregation
# Install stern
wget https://github.com/stern/stern/releases/download/v1.22.0/stern_1.22.0_linux_amd64.tar.gz
tar -zxvf stern_1.22.0_linux_amd64.tar.gz
mv stern /usr/local/bin/
# View logs of all backend pods
stern -n production backend-*netshoot pod for network debugging
apiVersion: v1
kind: Pod
metadata:
name: netshoot
spec:
containers:
- name: netshoot
image: nicolaka/netshoot
command: ["sleep", "3600"]5.4 Preventive Measures
Resource quotas:
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: production
spec:
hard:
requests.cpu: "100"
requests.memory: 200Gi
limits.cpu: "200"
limits.memory: 400Gi
persistentvolumeclaims: "50"Pod Disruption Budget (PDB):
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: backend-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: backendHealth‑check best practices:
apiVersion: v1
kind: Pod
metadata:
name: webapp
spec:
containers:
- name: app
image: webapp:v1.0
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
startupProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 30Daily inspection script (k8s_health_check.sh):
#!/bin/bash
# Node status check
echo "=== Node Status ==="
kubectl get nodes -o wide
# Abnormal pods
echo -e "
=== Abnormal Pods ==="
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
# Pods with high restarts
echo -e "
=== Pods with High Restarts ==="
kubectl get pods -A -o json | jq -r '.items[] | select(.status.containerStatuses[]?.restartCount > 5) | "\(.metadata.namespace)/\(.metadata.name) - Restarts: \(.status.containerStatuses[0].restartCount)"'
# Top resource usage
echo -e "
=== Top Resource Consumers ==="
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -10
# Recent warning events
echo -e "
=== Recent Warning Events ==="
kubectl get events -A --sort-by='.lastTimestamp' | grep -i "warning\|error" | tail -206. Summary and Outlook
Kubernetes fault diagnosis is a systematic engineering effort that requires comprehensive knowledge from infrastructure to hands‑on techniques. This manual has covered pod lifecycle, resource scheduling, network communication, storage management, and provided six real‑world cases demonstrating complete problem identification and resolution workflows.
Key takeaways:
Systematic troubleshooting flow: information collection → log analysis → deep diagnosis.
Proficiency with core tools: kubectl, describe, logs, events.
Understanding underlying mechanisms: pod scheduling, networking model, storage binding.
Establishing monitoring: Prometheus metrics, EFK logs, alert rules.
Preventive measures: resource quotas, health checks, PDB, regular inspections.
Future trends:
AIOps for predictive fault detection and automated remediation.
eBPF‑based deep observability (Cilium, Pixie).
Service‑mesh enhancements (Istio, Linkerd) for stronger traffic control and isolation.
GitOps operational model (Argo CD, Flux) for declarative configuration and automated rollbacks.
Edge‑computing extensions (KubeEdge) bringing K8s capabilities to edge nodes.
Continuous learning is essential for SREs. Follow the Kubernetes official blog, CNCF project updates, and engage in community discussions. Treat each incident as an opportunity to improve, document solutions, and automate repetitive tasks. Mastering Kubernetes troubleshooting empowers you to keep services reliable and resilient in the cloud‑native era.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
