13 Common Kubernetes Pod Failures and How to Diagnose Them
This article outlines the Kubernetes pod lifecycle, describes the five pod phases, enumerates 13 typical failure scenarios—including scheduling, image pull, dependency, init container, probe, and OOM issues—provides error states, root causes, and step‑by‑step kubectl commands for diagnosis and remediation.
Pod Lifecycle Overview
Kubernetes treats a Pod as the fundamental unit of work. A Pod goes through a series of phases—Pending, Running, Succeeded, Failed, and Unknown—each reflecting a high‑level status of the workload. Detailed conditions such as PodScheduled , Ready , and Initialized explain why a Pod is in a particular phase.
Pod Phases
Pending : The Pod has been created but is waiting for scheduling, image pull, and container creation.
Running : All containers are created and at least one is running.
Succeeded : All containers terminated successfully and will not be restarted.
Failed : At least one container terminated with a non‑zero exit code or was killed by the system.
Unknown : The Pod’s status cannot be obtained, usually due to node‑side problems.
Jobs typically end in Succeeded , while Deployments aim to keep Pods in Running until they are deliberately deleted or encounter a failure.
Common Failure Scenarios (13 Cases)
Pod problems can be grouped into two broad categories: issues that occur before a container starts (e.g., scheduling) and issues that appear while a container is running.
1. Scheduling Failure (Unschedulable)
When the scheduler cannot find a node that satisfies the Pod’s resource requests and placement rules, the Pod remains in Pending with the Unschedulable condition.
Insufficient node resources : No node has enough CPU/memory/disk to meet the summed request values of all containers.
Namespace resource quota exceeded : The Pod’s requested resources surpass the quota defined for its namespace.
NodeSelector mismatch : No node carries the required label.
Affinity/Anti‑affinity rules not satisfied : Hard affinity constraints block scheduling.
Node taints without tolerations : Pods lacking matching tolerations are rejected.
No ready nodes : All nodes are NotReady or otherwise unavailable.
2. Image Pull Failure (ImagePullBackOff)
The node cannot retrieve the container image, leading to the ImagePullBackOff state.
Incorrect image name or tag.
Missing or mis‑configured image pull secret for private registries.
Network connectivity problems (firewall, VPC, or public‑access restrictions).
Pull timeout due to large image size or limited bandwidth.
Concurrent pull limits causing throttling.
3. Dependency Errors (Error)
Missing or unreadable ConfigMap, Secret, or PersistentVolume prevents the Pod from starting, keeping it in Pending until the dependency is resolved.
4. Container Creation Failure (Error)
Reasons include violation of security policies, missing service‑account permissions, absent start command, or malformed command/args specifications.
5. Init Container Failure (CrashLoopBackOff)
If an init container exits with a non‑zero code, the main containers are blocked. Check the init container logs for the root cause.
6. PostStart/PreStop Hook Failure (FailedPostStartHook / FailedPreStopHook)
Asynchronous lifecycle hooks that return an error cause the Pod to be terminated according to its restart policy.
7. Readiness Probe Failure
The Pod remains NotReady and receives no traffic. Causes include application not listening on the expected port, mis‑configured probe parameters, or node‑level resource pressure.
8. Liveness Probe Failure (CrashLoopBackOff)
A failing liveness probe kills the container, triggering a restart loop. Common in high‑traffic bursts where the container cannot respond quickly enough.
9. Container Crash (CrashLoopBackOff)
Immediate crashes often stem from missing executable paths, permission issues, or absent foreground processes. Persistent crashes after some runtime indicate application bugs or resource exhaustion.
10. OOMKilled
When a container exceeds its memory limit, the kernel kills it with exit code 137. Typical triggers are unbounded JVM heap settings or overall node memory overcommit.
11. Pod Eviction (Pod Evicted)
Node‑level memory or disk pressure triggers the kubelet to evict lower‑QoS Pods. Adjust requests / limits and ephemeral-storage settings to mitigate.
12. Unknown State (Unknown)
Usually caused by a failing kubelet on the node, preventing the API server from receiving status updates.
13. Stuck in Terminating
Finalizers that do not complete, unresponsive containers, or node failures can leave a Pod in Terminating. Removing the finalizer via kubectl patch or force‑deleting the Pod are common remedies.
Diagnostic Commands
kubectl get pod <podName> -n <namespace> -o yaml</code>
<code>kubectl describe pod <podName> -n <namespace></code>
<code>kubectl logs <podName> -c <containerName> -n <namespace></code>
<code>kubectl exec -it <podName> -n <namespace> -c <containerName> -- <command>EDAS Troubleshooting Toolkit (Reference)
Alibaba Cloud’s EDAS platform aggregates these failure patterns, offering pre‑check, event observation, and an interactive cloud‑native toolbox (including Arthas , tcpdump , and pod‑copy capabilities) to accelerate root‑cause analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
