Operations 21 min read

Mastering Kubernetes Pod Lifecycle: Real‑World Troubleshooting Techniques

This comprehensive guide dissects every stage of the Kubernetes Pod lifecycle, explains underlying mechanisms, and equips operators with practical debugging commands, scripts, and best‑practice configurations to swiftly resolve common production issues such as pending pods, crash loops, slow startups, and network failures.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Mastering Kubernetes Pod Lifecycle: Real‑World Troubleshooting Techniques

Deep Dive: Kubernetes Pod Lifecycle and Production Troubleshooting

As an operations engineer, have you ever seen a Pod mysteriously stuck in Pending or crashing immediately after start? This article analyzes each detail of the Pod lifecycle from the source code perspective and provides the most practical troubleshooting techniques for production environments.

Preface: Why Pod lifecycle matters?

In my five years of Kubernetes operations, 80% of production incidents are related to improper Pod lifecycle management. Understanding the complete lifecycle helps you quickly locate problems and design more robust application architectures.

Today we will explore:

Pod lifecycle's 7 key phases

Source code analysis of each phase

Top 10 common production Pod issues

Effective troubleshooting tools and techniques

1. Deep analysis of Pod lifecycle

1.1 Pod state machine model

Pod lifecycle can be seen as a complex state machine, where each state transition has specific trigger conditions and handling logic.

# Pod state definitions (from k8s source)
apiVersion: v1
kind: Pod
metadata:
  name: lifecycle-demo
spec:
  containers:
  - name: app
    image: nginx:1.20
    lifecycle:
      postStart:
        exec:
          command: ["/bin/sh", "-c", "echo 'PostStart hook executed' > /var/log/poststart.log"]
      preStop:
        exec:
          command: ["/bin/sh", "-c", "sleep 15 && echo 'PreStop hook executed' > /var/log/prestop.log"]
    readinessProbe:
      httpGet:
        path: /
        port: 80
      initialDelaySeconds: 5
      periodSeconds: 3
    livenessProbe:
      httpGet:
        path: /health
        port: 80
      initialDelaySeconds: 30
      periodSeconds: 10

1.2 The 7 key phases of Pod lifecycle

Phase 1: Pending (waiting for scheduling)

When a Pod is created it first enters Pending. The Scheduler must select a suitable Node for the Pod.

Core principle:

Scheduler selects the best Node through Predicates (filtering) and Priorities (scoring) algorithms.

Predicates filter out Nodes that lack resources, have taints, etc.

Priorities score the remaining Nodes and pick the highest.

// Simplified Kubernetes Scheduler core logic
func (sched *Scheduler) scheduleOne(ctx context.Context) {
    // 1. Get the next Pod from the queue
    podInfo := sched.NextPod()

    // 2. Predicates (feasibility)
    feasibleNodes, err := sched.findNodesThatFitPod(ctx, pod)
    if err != nil {
        // Pod stays Pending, wait for next scheduling attempt
        return
    }

    // 3. Priorities (scoring)
    priorityList, err := sched.prioritizeNodes(ctx, pod, feasibleNodes)

    // 4. Choose the optimal Node
    host, err := sched.selectHost(priorityList)

    // 5. Bind Pod to Node
    err = sched.bind(ctx, pod, host)
}

Phase 2: Container Creating

After the Pod is assigned to a Node, kubelet starts creating containers. This includes image pulling, network configuration, and storage mounting.

# Monitor container creation steps
# View Pod events
kubectl describe pod <pod-name>
# View kubelet logs
journalctl -u kubelet -f | grep <pod-name>
# View container runtime logs (example for containerd)
crictl ps -a | grep <pod-name>
crictl logs <container-id>

Phase 3: Running

When containers start successfully, the Pod enters Running. However, the application is not yet considered ready until health checks pass.

PostStart Hook timing

lifecycle:
  postStart:
    exec:
      command: ["/bin/sh", "-c", "echo 'Container started at $(date)' >> /tmp/startup.log"]

Key point: PostStart runs in parallel with the main container process; if the hook fails, the container is killed.

Phase 4: Ready

Only after passing the readinessProbe does the Pod become Ready and start receiving traffic.

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3
  successThreshold: 1

Deep analysis: Ready status directly influences Service Endpoints and is a critical routing decision point.

1.3 Health check mechanisms deep dive

Liveness Probe: process‑level health

# Example: Python Flask health endpoint
from flask import Flask, jsonify
import psutil, time
app = Flask(__name__)
startup_time = time.time()
@app.route('/health')
def liveness_check():
    uptime = time.time() - startup_time
    if uptime < 60:
        return jsonify({'status':'unhealthy','reason':'still_starting'}), 503
    memory_percent = psutil.virtual_memory().percent
    if memory_percent > 90:
        return jsonify({'status':'unhealthy','reason':'memory_exhausted'}), 503
    return jsonify({'status':'healthy','uptime':uptime})
@app.route('/ready')
def readiness_check():
    try:
        import redis
        r = redis.Redis(host='redis-service', port=6379, socket_timeout=1)
        r.ping()
        return jsonify({'status':'ready'})
    except Exception as e:
        return jsonify({'status':'not_ready','reason':str(e)}), 503
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Startup Probe: rescue for slow‑starting apps

startupProbe:
  httpGet:
    path: /startup
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 30  # wait up to 300 seconds
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

2. Common production Pod problems and practical troubleshooting

2.1 Pod stuck in Pending

Problem symptoms

: Pod stays Pending and cannot be scheduled to any Node.

Investigation steps

:

# 1. View Pod events
kubectl describe pod <pod-name> | grep -A 10 Events
# 2. Check Node resource status
kubectl describe nodes | grep -A 5 "Allocated resources"
# 3. Check taints and tolerations
kubectl get nodes -o json | jq '.items[].spec.taints'

Common causes & solutions

:

# Cause 1: insufficient resources
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: app
    resources:
      requests:
        memory: "2Gi"
        cpu: "500m"
      limits:
        memory: "4Gi"
        cpu: "1000m"
# Cause 2: node selector misconfiguration
apiVersion: v1
kind: Pod
spec:
  nodeSelector:
    kubernetes.io/os: linux
  tolerations:
  - key: "node.kubernetes.io/not-ready"
    operator: "Exists"
    effect: "NoExecute"
    tolerationSeconds: 300

2.2 CrashLoopBackOff

Diagnostic script

:

#!/bin/bash
POD_NAME=$1
NAMESPACE=${2:-default}
echo "=== Pod restart analysis report ==="
echo "Pod: $POD_NAME"
echo "Namespace: $NAMESPACE"
echo "Time: $(date)"

# Current status
kubectl get pod $POD_NAME -n $NAMESPACE -o wide

# Restart history
kubectl get pod $POD_NAME -n $NAMESPACE -o json | \
  jq -r '.status.containerStatuses[] | "Container: \(.name), Restarts: \(.restartCount)"'

# Recent logs
kubectl logs $POD_NAME -n $NAMESPACE --tail=50
# Previous container logs
kubectl logs $POD_NAME -n $NAMESPACE --previous --tail=50
# Pod events
kubectl describe pod $POD_NAME -n $NAMESPACE | grep -A 20 Events

Typical case: memory leak causing restarts

// Problematic Go application example
package main
import (
    "fmt"
    "time"
)
func main() {
    var memoryHog [][]byte
    for {
        data := make([]byte, 1024*1024) // allocate 1MiB
        memoryHog = append(memoryHog, data)
        fmt.Printf("Allocated memory: %d MB
", len(memoryHog))
        time.Sleep(1 * time.Second)
        // without proper cleanup, OOMKilled will occur
    }
}

Solution

:

# Apply proper resource limits
apiVersion: apps/v1
kind: Deployment
metadata:
  name: memory-safe-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: memory-safe-app
  template:
    spec:
      containers:
      - name: app
        image: myapp:v1.0
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"
        env:
        - name: GOGC
          value: "20"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5

2.3 Slow pod startup

Analysis tools

:

# Python profiler for pod startup
import subprocess, json, datetime, time

def get_pod_events(pod_name, namespace="default"):
    cmd = f"kubectl get events --field-selector involvedObject.name={pod_name} -n {namespace} -o json"
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    if result.returncode != 0:
        print(f"Error: {result.stderr}")
        return
    events = json.loads(result.stdout)
    timeline = []
    for event in events['items']:
        timeline.append({
            'time': event['firstTimestamp'],
            'reason': event['reason'],
            'message': event['message']
        })
    timeline.sort(key=lambda x: x['time'])
    for i, ev in enumerate(timeline, 1):
        print(f"{i}. [{ev['time']}] {ev['reason']}: {ev['message']}")
    return timeline

def analyze_startup_bottlenecks(pod_name, namespace="default"):
    cmd = f"kubectl get pod {pod_name} -n {namespace} -o json"
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    pod_info = json.loads(result.stdout)
    containers = pod_info['spec']['containers']
    print("
Image analysis:")
    for c in containers:
        print(f"  - {c['name']}: {c['image']}")
    # Additional image size check could be added here

if __name__ == "__main__":
    import sys
    pod = sys.argv[1]
    ns = sys.argv[2] if len(sys.argv) > 2 else "default"
    get_pod_events(pod, ns)
    analyze_startup_bottlenecks(pod, ns)

Optimization strategies

:

# Multi‑stage Dockerfile to reduce image size
FROM alpine:3.18 AS builder
RUN apk add --no-cache go git
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o main .

FROM scratch
COPY --from=builder /app/main /main
COPY --from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/
EXPOSE 8080
ENTRYPOINT ["/main"]

2.4 Network related issues

Network debug pod

:

apiVersion: v1
kind: Pod
metadata:
  name: network-debug
spec:
  containers:
  - name: debug
    image: nicolaka/netshoot
    command: ["/bin/bash"]
    args: ["-c","while true; do sleep 30; done"]
    securityContext:
      capabilities:
        add: ["NET_ADMIN","NET_RAW"]

Common commands

:

# DNS test
nslookup kubernetes.default.svc.cluster.local
dig @10.96.0.10 kubernetes.default.svc.cluster.local
# Connectivity
ping <target-ip>
telnet <service-name> <port>
curl -v http://<service-name>:<port>/health
# Routing table
ip route show
iptables -t nat -L
# Interface info
ip addr show
ss -tuln

3. Advanced debugging tips and best practices

3.1 Using kubectl‑debug plugin

# Install kubectl‑debug plugin
curl -Lo kubectl-debug.tar.gz https://github.com/aylei/kubectl-debug/releases/download/v0.1.1/kubectl-debug_0.1.1_linux_amd64.tar.gz
tar -zxvf kubectl-debug.tar.gz kubectl-debug
sudo mv kubectl-debug /usr/local/bin/
# Debug a running pod
kubectl debug <pod-name> --image=nicolaka/netshoot --share-processes --copy-to=debug-pod

3.2 Pod lifecycle monitoring script

#!/bin/bash
NAMESPACE=${1:-default}
INTERVAL=${2:-5}
while true; do
  clear
  echo "=== Pod lifecycle monitor - $(date) ==="
  echo "Pod status summary:"
  kubectl get pods -n $NAMESPACE -o json | jq -r '.items | group_by(.status.phase) | map({phase: .[0].status.phase, count: length}) | .[] | "\(.phase): \(.count)"'
  echo "Problem pods:"
  kubectl get pods -n $NAMESPACE -o json | jq -r '.items[] | select(.status.phase != "Running" or (.status.containerStatuses[]? | .restartCount > 0)) | "\(.metadata.name) | \(.status.phase) | Restarts:\(.status.containerStatuses[]?.restartCount // 0)"'
  echo "Recent events:"
  kubectl get events -n $NAMESPACE --sort-by=.lastTimestamp | tail -5
  sleep $INTERVAL
done

3.3 Pod security best practices

# Secure pod example
apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
  annotations:
    container.apparmor.security.beta.kubernetes.io/app: runtime/default
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    runAsGroup: 1000
    fsGroup: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: app
    image: myapp:v1.0
    securityContext:
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop: ["ALL"]
        add: ["NET_BIND_SERVICE"]
      volumeMounts:
      - name: tmp
        mountPath: /tmp
      - name: var-run
        mountPath: /var/run
    resources:
      requests:
        memory: "64Mi"
        cpu: "50m"
      limits:
        memory: "128Mi"
        cpu: "100m"
  volumes:
  - name: tmp
    emptyDir: {}
  - name: var-run
    emptyDir: {}

Summary and action recommendations

Mastering the Pod lifecycle not only means understanding state transitions, but also designing robust architectures, automating monitoring, applying preventive resource and security configurations, and continuously learning new Kubernetes features.

Build systematic thinking : understand scheduling, networking, storage, and security dimensions.

Tool‑driven operations : create automated monitoring and troubleshooting scripts.

Preventive design : use proper resource limits, health checks, and security policies.

Continuous learning : stay aware of evolving Kubernetes features.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeKubernetestroubleshootingPod Lifecycle
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.