Cloud Native 50 min read

How I Cut Kubernetes Troubleshooting Time from 30 Minutes to 3 Minutes

This article presents a complete, step‑by‑step method for reducing average Kubernetes fault‑diagnosis time from half an hour to under three minutes, covering the root causes of slow manual debugging, a one‑click diagnostic script, efficient kubectl shortcuts, visual tools, log aggregation, automated response workflows, and real‑world case studies.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How I Cut Kubernetes Troubleshooting Time from 30 Minutes to 3 Minutes

From 30 Minutes to 3 Minutes: How I Optimized Kubernetes Troubleshooting

Introduction

"Pod crashed again! Why is Ingress unreachable? Why does a PV mount fail?" These alerts woke me up at 3 am. I ran the usual sequence of kubectl get pods, kubectl describe, and kubectl logs, but after 30 minutes the problem was still not located. After six months of practice and optimisation I reduced the average troubleshooting time from 30 minutes to under 3 minutes, with core‑issue identification often under 1 minute. This article shares the complete method, including an automated diagnostic script, efficient command sets, visual tools, and a full troubleshooting workflow.

Technical Background: Why Traditional Kubernetes Troubleshooting Is Slow

Kubernetes Fault Complexity

Kubernetes is a multi‑layered system. A simple application failure may involve any of the following layers:

Application layer : process state, config, logs inside the container

Pod layer : container status, restart count, resource limits, health checks

Service layer : Service, Endpoints, load balancing

Network layer : CNI plugin, NetworkPolicy, Ingress, DNS

Storage layer : PV, PVC, StorageClass, mount status

Scheduling layer : node resources, taints, affinity rules

Control plane : API Server, Controller Manager, Scheduler

Because a fault can exist in any layer—or be a combination of several layers—traditional debugging requires checking each layer sequentially, which is time‑consuming and error‑prone.

Pain Points of Traditional Debugging

Verbose and repetitive commands

# Typical manual debugging flow
kubectl get pods -n production
kubectl describe pod nginx-xxx -n production
kubectl logs nginx-xxx -n production
kubectl logs nginx-xxx -n production --previous
kubectl get svc -n production
kubectl describe svc nginx-service -n production
kubectl get endpoints -n production
kubectl get ingress -n production
# ... more than 10 commands for a basic diagnosis

Scattered log information

Pod logs are on each node

System events are in the Event object

Audit logs are in the API Server

Network logs are in the CNI plugin

Application logs are inside the container

There is no unified log query entry, so locating the problem requires jumping between many places.

Difficult to correlate status information

Pod abnormal → need to check node status

Service unreachable → need to check Endpoints

Ingress 502 → need to check Service, Pod, network

Storage mount failure → need to check PVC, PV, StorageClass

Correlating these pieces requires multiple commands and manual analysis.

Lack of historical comparison

Did resource quota change?

Did replica count change?

Did configuration change?

Did image version change?

Without historical data it is hard to know whether a change caused the issue.

Real Incident: 30‑Minute Painful Debugging

At 14:00 an alert indicated a massive API timeout. The manual steps taken were:

14:02 kubectl get pods – three pods looked normal

14:05 kubectl describe pod – no obvious abnormality

14:08 kubectl logs – no clear error

14:12 kubectl get svc & kubectl get endpoints – only one endpoint

14:15 kubectl describe pod – found two pods with Ready=0/1

14:18 kubectl describe – health check failed

14:22 kubectl exec – port not reachable

14:25 kubectl logs – database connection timeout

14:28 kubectl describe configmap – database config was wrong

The whole process took 30 minutes, even though the root cause was simple.

Core Optimisations: From 30 Minutes to 3 Minutes

Strategy 1 – One‑Click Diagnostic Script (Core Optimisation)

A comprehensive Bash script automates 90 % of the manual steps. It checks cluster connectivity, namespace existence, pod status, recent logs, services, ingress, storage, node health, recent events, and finally produces a concise summary.

#!/bin/bash
################################################################################
# Script name: k8s-quick-diagnose.sh
# Description: Kubernetes quick diagnostic script
# Version: v2.0
# Usage: ./k8s-quick-diagnose.sh [namespace] [pod-name-pattern]
################################################################################

# Configuration area
NAMESPACE="${1:-default}"
POD_PATTERN="${2:-.*}"
REPORT_FILE="/tmp/k8s-diagnose-$(date +%Y%m%d_%H%M%S).log"

# Colour definitions (removed for brevity)

log(){ echo -e "[$(date +"%H:%M:%S")] $1" | tee -a "$REPORT_FILE"; }
log_section(){ echo; echo "═══════════════════════════════════════════════════════════"; echo -e "\033[0;34m $1\033[0m" | tee -a "$REPORT_FILE"; echo "═══════════════════════════════════════════════════════════"; }
log_error(){ echo -e "\033[0;31m[✗ ERROR] $1\033[0m" | tee -a "$REPORT_FILE"; }
log_warning(){ echo -e "\033[0;33m[⚠ WARNING] $1\033[0m" | tee -a "$REPORT_FILE"; }
log_ok(){ echo -e "\033[0;32m[✓ OK] $1\033[0m" | tee -a "$REPORT_FILE"; }
log_info(){ echo -e "\033[0;34m[ℹ INFO] $1\033[0m" | tee -a "$REPORT_FILE"; }

check_cluster_connection(){
  log_section "1. Cluster connection check"
  if ! kubectl cluster-info &>/dev/null; then
    log_error "Cannot connect to Kubernetes cluster"
    log_info "Check KUBECONFIG: echo $KUBECONFIG"
    exit 1
  fi
  CONTEXT=$(kubectl config current-context)
  CLUSTER=$(kubectl config view -o jsonpath="{.contexts[?(@.name==\"$CONTEXT\")].context.cluster}")
  log_ok "Cluster connection OK"
  log "Current Context: $CONTEXT"
  log "Current Cluster: $CLUSTER"
  SERVER_VERSION=$(kubectl version --short 2>/dev/null | grep Server | awk '{print $3}')
  log "Cluster version: $SERVER_VERSION"
}

check_namespace(){
  log_section "2. Namespace check"
  if ! kubectl get namespace "$NAMESPACE" &>/dev/null; then
    log_error "Namespace $NAMESPACE does not exist"
    log "Available namespaces:"
    kubectl get namespaces | awk '{print "  "$0}' | tee -a "$REPORT_FILE"
    exit 1
  fi
  log_ok "Namespace $NAMESPACE exists"
  QUOTA=$(kubectl get resourcequota -n "$NAMESPACE" -o name 2>/dev/null)
  if [ -n "$QUOTA" ]; then
    log "ResourceQuota:"
    kubectl describe resourcequota -n "$NAMESPACE" | grep -A 10 "Used" | awk '{print "  "$0}' | tee -a "$REPORT_FILE"
  fi
}

check_pods(){
  log_section "3. Pod status check"
  PODS=$(kubectl get pods -n "$NAMESPACE" -o json | jq -r ".items[] | select(.metadata.name | test(\"$POD_PATTERN\")) | .metadata.name")
  if [ -z "$PODS" ]; then
    log_warning "No pod matches pattern '$POD_PATTERN'"
    return
  fi
  POD_COUNT=$(echo "$PODS" | wc -l)
  log_info "Found $POD_COUNT pods"
  echo "$PODS" | while read pod; do
    STATUS=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.status.phase}')
    READY=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
    RESTARTS=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[0].restartCount}')
    log "Pod: $pod"
    case "$STATUS" in
      "Running")
        if [ "$READY" == "True" ]; then
          log_ok "  Status: $STATUS (Ready)"
        else
          log_warning "  Status: $STATUS (Not Ready)"
          CONTAINER_STATUS=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{range .status.containerStatuses[*]}Container:{.name} Ready:{.ready} State:{.state}
{end}')
          echo "$CONTAINER_STATUS" | while read line; do log_warning "    $line"; done
        fi
        ;;
      "Pending")
        log_error "  Status: $STATUS (Pending scheduling)"
        EVENTS=$(kubectl get events -n "$NAMESPACE" --field-selector involvedObject.name="$pod" --sort-by='.lastTimestamp' | tail -5)
        log "  Recent events:"; echo "$EVENTS" | awk '{print "    "$0}' | tee -a "$REPORT_FILE"
        ;;
      "Failed"|"CrashLoopBackOff"|"Error")
        log_error "  Status: $STATUS"
        ;;
      *)
        log_warning "  Status: $STATUS"
        ;;
    esac
    if [ "$RESTARTS" -gt 0 ]; then
      log_warning "  Restarts: $RESTARTS"
      TERMINATED_REASON=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}')
      if [ -n "$TERMINATED_REASON" ]; then
        log_warning "    Last termination reason: $TERMINATED_REASON"
      fi
    fi
    CPU_REQUEST=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].resources.requests.cpu}')
    MEM_REQUEST=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].resources.requests.memory}')
    CPU_LIMIT=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].resources.limits.cpu}')
    MEM_LIMIT=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].resources.limits.memory}')
    log "  Resource config: Request CPU=$CPU_REQUEST Memory=$MEM_REQUEST"
    log "  Resource config: Limit CPU=$CPU_LIMIT Memory=$MEM_LIMIT"
    NODE=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.nodeName}')
    POD_IP=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.status.podIP}')
    log "  Node: $NODE"
    log "  Pod IP: $POD_IP"
    HAS_LIVENESS=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].livenessProbe}')
    HAS_READINESS=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].readinessProbe}')
    if [ -n "$HAS_LIVENESS" ]; then log "  Liveness probe: Configured"; else log_warning "  Liveness probe: Not configured"; fi
    if [ -n "$HAS_READINESS" ]; then log "  Readiness probe: Configured"; else log_warning "  Readiness probe: Not configured"; fi
  done
}

check_pod_logs(){
  log_section "4. Pod logs (last 50 lines)"
  PODS=$(kubectl get pods -n "$NAMESPACE" -o json | jq -r ".items[] | select(.metadata.name | test(\"$POD_PATTERN\")) | .metadata.name")
  echo "$PODS" | while read pod; do
    log "=== Pod: $pod logs ==="
    CONTAINERS=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.containers[*].name}')
    CONTAINER_COUNT=$(echo "$CONTAINERS" | wc -w)
    if [ $CONTAINER_COUNT -gt 1 ]; then
      log "Pod has $CONTAINER_COUNT containers: $CONTAINERS"
      for container in $CONTAINERS; do
        log "--- Container: $container ---"
        kubectl logs "$pod" -n "$NAMESPACE" -c "$container" --tail=20 2>&1 | awk '{print "  "$0}' | tee -a "$REPORT_FILE"
      done
    else
      kubectl logs "$pod" -n "$NAMESPACE" --tail=20 2>&1 | awk '{print "  "$0}' | tee -a "$REPORT_FILE"
    fi
    RESTARTS=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[0].restartCount}')
    if [ $RESTARTS -gt 0 ]; then
      log_warning "Detected restarts, checking previous crash logs"
      kubectl logs "$pod" -n "$NAMESPACE" --previous --tail=20 2>&1 | awk '{print "  "$0}' | tee -a "$REPORT_FILE"
    fi
  done
}

check_services(){
  log_section "5. Service and Endpoints check"
  SERVICES=$(kubectl get svc -n "$NAMESPACE" -o name 2>/dev/null)
  if [ -z "$SERVICES" ]; then
    log_warning "No Service in namespace $NAMESPACE"
    return
  fi
  echo "$SERVICES" | while read svc_name; do
    SVC=$(echo "$svc_name" | cut -d'/' -f2)
    log "Service: $SVC"
    TYPE=$(kubectl get svc "$SVC" -n "$NAMESPACE" -o jsonpath='{.spec.type}')
    CLUSTER_IP=$(kubectl get svc "$SVC" -n "$NAMESPACE" -o jsonpath='{.spec.clusterIP}')
    PORTS=$(kubectl get svc "$SVC" -n "$NAMESPACE" -o jsonpath='{.spec.ports[*].port}')
    SELECTOR=$(kubectl get svc "$SVC" -n "$NAMESPACE" -o jsonpath='{.spec.selector}')
    log "  Type: $TYPE"
    log "  ClusterIP: $CLUSTER_IP"
    log "  Ports: $PORTS"
    log "  Selector: $SELECTOR"
    ENDPOINTS=$(kubectl get endpoints "$SVC" -n "$NAMESPACE" -o jsonpath='{.subsets[*].addresses[*].ip}' 2>/dev/null)
    ENDPOINT_COUNT=$(echo "$ENDPOINTS" | wc -w)
    if [ $ENDPOINT_COUNT -eq 0 ]; then
      log_error "  Endpoints: 0 (no backend pods)"
      log_warning "    Possible reasons: 1) Pods not ready 2) Selector mismatch 3) Pods missing"
      if [ -n "$SELECTOR" ]; then
        SELECTOR_STR=$(echo "$SELECTOR" | sed 's/map\[//g' | sed 's/\]//g' | sed 's/ /,/g')
        MATCHING_PODS=$(kubectl get pods -n "$NAMESPACE" -l "$SELECTOR_STR" -o name 2>/dev/null | wc -l)
        log "    Matching pod count: $MATCHING_PODS"
      fi
    else
      log_ok "  Endpoints: $ENDPOINT_COUNT"
      log "    Backend IPs: $ENDPOINTS"
    fi
    if [ "$TYPE" == "LoadBalancer" ]; then
      EXTERNAL_IP=$(kubectl get svc "$SVC" -n "$NAMESPACE" -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
      if [ -n "$EXTERNAL_IP" ]; then
        log_ok "  External IP: $EXTERNAL_IP"
      else
        log_warning "  External IP: Pending (awaiting allocation)"
      fi
    fi
  done
}

check_ingress(){
  log_section "6. Ingress check"
  INGRESSES=$(kubectl get ingress -n "$NAMESPACE" -o name 2>/dev/null)
  if [ -z "$INGRESSES" ]; then
    log "No Ingress in namespace $NAMESPACE"
    return
  fi
  echo "$INGRESSES" | while read ing_name; do
    ING=$(echo "$ing_name" | cut -d'/' -f2)
    log "Ingress: $ING"
    CLASS=$(kubectl get ingress "$ING" -n "$NAMESPACE" -o jsonpath='{.spec.ingressClassName}')
    HOSTS=$(kubectl get ingress "$ING" -n "$NAMESPACE" -o jsonpath='{.spec.rules[*].host}')
    log "  IngressClass: $CLASS"
    log "  Hosts: $HOSTS"
    RULES=$(kubectl get ingress "$ING" -n "$NAMESPACE" -o jsonpath="{range .spec.rules[*]}{.host}{' -> '}{range .http.paths[*]}{.path}{' -> '}{.backend.service.name}{':'}.backend.service.port.number}{'
'}{end}{end}")
    log "  Routing rules:"
    echo "$RULES" | while read rule; do log "    $rule"; done
    ADDRESS=$(kubectl get ingress "$ING" -n "$NAMESPACE" -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
    if [ -n "$ADDRESS" ]; then
      log_ok "  Address: $ADDRESS"
    else
      log_warning "  Address: Pending (awaiting Ingress controller allocation)"
    fi
    TLS=$(kubectl get ingress "$ING" -n "$NAMESPACE" -o jsonpath='{.spec.tls}')
    if [ -n "$TLS" ]; then
      log_ok "  TLS: Configured"
      TLS_SECRETS=$(kubectl get ingress "$ING" -n "$NAMESPACE" -o jsonpath='{.spec.tls[*].secretName}')
      log "    Secret: $TLS_SECRETS"
    else
      log "  TLS: Not configured"
    fi
  done
}

check_storage(){
  log_section "7. Storage check (PVC/PV)"
  PVCS=$(kubectl get pvc -n "$NAMESPACE" -o name 2>/dev/null)
  if [ -z "$PVCS" ]; then
    log "No PVC in namespace $NAMESPACE"
    return
  fi
  echo "$PVCS" | while read pvc_name; do
    PVC=$(echo "$pvc_name" | cut -d'/' -f2)
    log "PVC: $PVC"
    STATUS=$(kubectl get pvc "$PVC" -n "$NAMESPACE" -o jsonpath='{.status.phase}')
    CAPACITY=$(kubectl get pvc "$PVC" -n "$NAMESPACE" -o jsonpath='{.status.capacity.storage}')
    SC=$(kubectl get pvc "$PVC" -n "$NAMESPACE" -o jsonpath='{.spec.storageClassName}')
    PV=$(kubectl get pvc "$PVC" -n "$NAMESPACE" -o jsonpath='{.spec.volumeName}')
    case "$STATUS" in
      "Bound") log_ok "  Status: $STATUS";;
      "Pending")
        log_error "  Status: $STATUS (waiting for bind)"
        if [ -n "$SC" ]; then
          SC_EXISTS=$(kubectl get storageclass "$SC" -o name 2>/dev/null)
          if [ -z "$SC_EXISTS" ]; then log_error "    StorageClass '$SC' does not exist"; fi
        else
          log_warning "    No StorageClass specified"
        fi
        ;;
      *) log_warning "  Status: $STATUS";;
    esac
    log "  Capacity: $CAPACITY"
    log "  StorageClass: $SC"
    log "  Bound PV: $PV"
    if [ -n "$PV" ]; then
      PV_STATUS=$(kubectl get pv "$PV" -o jsonpath='{.status.phase}' 2>/dev/null)
      PV_PATH=$(kubectl get pv "$PV" -o jsonpath='{.spec.hostPath.path}' 2>/dev/null)
      log "  PV status: $PV_STATUS"
      if [ -n "$PV_PATH" ]; then log "  Host path: $PV_PATH"; fi
    fi
  done
}

check_nodes(){
  log_section "8. Node status check"
  NODES=$(kubectl get pods -n "$NAMESPACE" -o json | jq -r ".items[] | select(.metadata.name | test(\"$POD_PATTERN\")) | .spec.nodeName" | sort -u)
  if [ -z "$NODES" ]; then
    log_warning "No related nodes found"
    return
  fi
  echo "$NODES" | while read node; do
    log "Node: $node"
    STATUS=$(kubectl get node "$node" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
    if [ "$STATUS" == "True" ]; then log_ok "  Status: Ready"; else log_error "  Status: Not Ready"; fi
    CPU_CAP=$(kubectl get node "$node" -o jsonpath='{.status.capacity.cpu}')
    MEM_CAP=$(kubectl get node "$node" -o jsonpath='{.status.capacity.memory}')
    CPU_ALLOC=$(kubectl get node "$node" -o jsonpath='{.status.allocatable.cpu}')
    MEM_ALLOC=$(kubectl get node "$node" -o jsonpath='{.status.allocatable.memory}')
    log "  Capacity: CPU=$CPU_CAP Memory=$MEM_CAP"
    log "  Allocatable: CPU=$CPU_ALLOC Memory=$MEM_ALLOC"
    DISK_PRESSURE=$(kubectl get node "$node" -o jsonpath='{.status.conditions[?(@.type=="DiskPressure")].status}')
    MEM_PRESSURE=$(kubectl get node "$node" -o jsonpath='{.status.conditions[?(@.type=="MemoryPressure")].status}')
    PID_PRESSURE=$(kubectl get node "$node" -o jsonpath='{.status.conditions[?(@.type=="PIDPressure")].status}')
    [ "$DISK_PRESSURE" == "True" ] && log_error "  Disk pressure: Yes"
    [ "$MEM_PRESSURE" == "True" ] && log_error "  Memory pressure: Yes"
    [ "$PID_PRESSURE" == "True" ] && log_error "  PID pressure: Yes"
    POD_COUNT=$(kubectl get pods -n "$NAMESPACE" --field-selector spec.nodeName="$node" 2>/dev/null | grep -v NAME | wc -l)
    log "  Pods on node: $POD_COUNT"
  done
}

check_events(){
  log_section "9. Recent events (last 30)"
  EVENTS=$(kubectl get events -n "$NAMESPACE" --sort-by='.lastTimestamp' | tail -30)
  if [ -z "$EVENTS" ]; then
    log "No recent events"
    return
  fi
  WARNING_COUNT=$(echo "$EVENTS" | grep -c "Warning" || echo 0)
  ERROR_COUNT=$(echo "$EVENTS" | grep -c "Error" || echo 0)
  log "Warning events: $WARNING_COUNT"
  log "Error events: $ERROR_COUNT"
  echo "$EVENTS" | awk '{print "  "$0}' | tee -a "$REPORT_FILE"
  log_info "Key events:"
  echo "$EVENTS" | grep -E "Failed|Error|BackOff|Unhealthy|FailedMount" | awk '{print "  "$0}' | tee -a "$REPORT_FILE"
}

generate_summary(){
  log_section "10. Diagnostic summary"
  ERROR_COUNT=$(grep -c "\[✗ ERROR\]" "$REPORT_FILE" || echo 0)
  WARNING_COUNT=$(grep -c "\[⚠ WARNING\]" "$REPORT_FILE" || echo 0)
  log "Diagnostic completed at $(date '+%Y-%m-%d %H:%M:%S')"
  log "Namespace: $NAMESPACE"
  log "Pod pattern: $POD_PATTERN"
  if [ $ERROR_COUNT -gt 0 ]; then log_error "Found $ERROR_COUNT errors"; fi
  if [ $WARNING_COUNT -gt 0 ]; then log_warning "Found $WARNING_COUNT warnings"; fi
  if [ $ERROR_COUNT -eq 0 ] && [ $WARNING_COUNT -eq 0 ]; then log_ok "No obvious issues"; fi
  log "Full report saved to: $REPORT_FILE"
  log_section "Problem localisation hints"
  if grep -q "Endpoints: 0" "$REPORT_FILE"; then
    log_error "Service has no available Endpoints"
    log "  → Check if Pods are Ready"
    log "  → Verify Service selector matches Pod labels"
  fi
  if grep -q "CrashLoopBackOff\|Failed" "$REPORT_FILE"; then
    log_error "Pod start failure or crash detected"
    log "  → View pod logs: kubectl logs <pod> -n $NAMESPACE"
    log "  → Check image correctness"
    log "  → Verify container start command and env vars"
  fi
  if grep -q "Pending.*waiting for scheduling" "$REPORT_FILE"; then
    log_error "Pod cannot be scheduled"
    log "  → Check node resource availability"
    log "  → Verify PV/PVC binding"
    log "  → Check node affinity and taint tolerations"
  fi
  if grep -q "Not Ready" "$REPORT_FILE"; then
    log_warning "Pod or node not Ready"
    log "  → Check health‑check configuration"
    log "  → Verify application start‑up time vs initialDelaySeconds"
  fi
}

main(){
  echo "════════════════════════════════════════════════════════════════"
  echo "           Kubernetes Quick Diagnose Script v2.0"
  echo "════════════════════════════════════════════════════════════════"
  echo ""
  if ! command -v kubectl &>/dev/null; then echo "Error: kubectl not installed"; exit 1; fi
  if ! command -v jq &>/dev/null; then echo "Warning: jq not installed, some features may be limited"; fi
  check_cluster_connection
  check_namespace
  check_pods
  check_pod_logs
  check_services
  check_ingress
  check_storage
  check_nodes
  check_events
  generate_summary
  echo ""
  echo "════════════════════════════════════════════════════════════════"
  echo "           Diagnosis completed!"
  echo "════════════════════════════════════════════════════════════════"
}

main "$@"

Usage

# Make the script executable
chmod +x k8s-quick-diagnose.sh

# Diagnose all pods in the default namespace
./k8s-quick-diagnose.sh

# Diagnose a specific namespace
./k8s-quick-diagnose.sh production

# Diagnose pods matching a regex in a namespace
./k8s-quick-diagnose.sh production "nginx.*"

# Run and stream output to a log file
./k8s-quick-diagnose.sh production | tee diagnosis.log

Strategy 2 – Efficient kubectl Command Cheatsheet

Define short aliases to speed up frequent operations:

# Add to ~/.bashrc or ~/.zshrc
alias k='kubectl'
alias kg='kubectl get'
alias kd='kubectl describe'
alias kdel='kubectl delete'
alias kl='kubectl logs'
alias kex='kubectl exec -it'
alias kgp='kubectl get pods'
alias kgn='kubectl get nodes'
alias kgs='kubectl get svc'
alias kgi='kubectl get ingress'
# Advanced aliases
alias kgpa='kubectl get pods --all-namespaces'
alias kgpw='kubectl get pods --watch'
alias kdelp='kubectl delete pod'
alias klf='kubectl logs -f'
alias keti='kubectl exec -it'
# Quick namespace switch
alias kns='kubectl config set-context --current --namespace'
# Resource sorting
alias kgpnode='kubectl get pods -o wide --sort-by=.spec.nodeName'
alias kgprestart='kubectl get pods --sort-by=.status.containerStatuses[0].restartCount'

Strategy 3 – Visual Tools to Accelerate Diagnosis

K9s – Powerful Terminal UI

K9s provides a real‑time view of cluster resources. Highlights:

Fast navigation with : commands (e.g., :pods, :svc, :deploy)

Automatic refresh and colour‑coded abnormal states

Quick actions: l for logs, d for describe, e for edit, s for shell, Ctrl‑K to delete, / to filter

Multi‑container support with easy switching

Lens – Enterprise‑grade Desktop Client

Lens offers a graphical IDE for Kubernetes with features such as multi‑cluster management, topology visualisation, built‑in Prometheus monitoring, and integrated terminal.

Strategy 4 – Log Aggregation for Rapid Root‑Cause Identification

Option 1: ELK/EFK Stack

Architecture: Pods → Fluent Bit → Elasticsearch → Kibana. Deploy Fluent Bit as a DaemonSet to ship container logs to Elasticsearch and visualise them in Kibana.

Option 2: Loki (lighter weight)

Loki integrates tightly with Grafana. Deploy via Helm:

# Helm installation
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki-stack \
  --namespace=logging \
  --create-namespace \
  --set grafana.enabled=true

Query examples:

# Error logs in a namespace
{namespace="production"} |= "error" | json
# Logs from a specific pod
{pod=~"nginx-.*"} |= "exception"
# Error rate over 5 minutes
rate({namespace="production"} |= "error" [5m])

Option 3: Stern – Multi‑Pod Log Aggregation

Stern streams logs from multiple pods simultaneously.

# Install Stern
brew install stern   # macOS
# Or download binary for Linux
curl -LO https://github.com/stern/stern/releases/download/v1.25.0/stern_1.25.0_linux_amd64.tar.gz
# Usage example
stern nginx -n production          # All nginx pods
stern nginx -n production -c nginx  # Specific container
stern nginx -n production --tail=100 | grep -i error

Strategy 5 – Automated Incident Response Workflow

A decision tree classifies common failures and triggers specialised scripts.

# Failure classification decision tree
# Alert
# ├─ Pod abnormal
# │   ├─ CrashLoopBackOff → check logs → verify start command/config
# │   ├─ ImagePullBackOff → check image name/registry credentials
# │   ├─ Pending → check node resources/PV status/affinity
# │   └─ Error → inspect Events → check resource limits
# ├─ Service unreachable
# │   ├─ Check Endpoints count
# │   ├─ Verify Pod Ready status
# │   └─ Ensure selector matches pod labels
# ├─ Ingress 502/503
# │   ├─ Check Service → Endpoints → Pod health
# │   ├─ Inspect Ingress controller logs
# │   └─ Validate routing rules
# └─ Storage issues
#     ├─ PVC Pending → verify StorageClass/PV availability
#     ├─ Mount failure → check node permissions/path existence
#     └─ Out of space → expand or clean up

Example script for pod crash handling ( pod-crash-handler.sh) and service troubleshooting ( service-troubleshoot.sh) are provided in the original article.

Practical Case: 30‑Minute vs 3‑Minute Real‑World Comparison

Before Optimisation (30 minutes)

At 19:05 an alert indicated high error rate on the payment‑API. The engineer manually executed a series of kubectl commands, inspected pods, services, endpoints, and logs, discovered a database connection timeout caused by a full disk, and finally cleaned up the disk. Total time: 30 minutes.

After Optimisation (3 minutes)

Using K9s the engineer instantly located the failing pod, viewed logs, switched to the related Service, saw missing endpoints, inspected the problematic database pods, entered the container shell, identified the full disk, ran an automated cleanup script, and the system recovered within 3 minutes.

Tools & Scripts Summary

A cheat‑sheet ( k8s-cheatsheet.md) consolidates the most useful kubectl one‑liners, and an installation script ( install-k8s-tools.sh) sets up K9s, Stern, kubectx/kubens, jq, and the kubectl‑tree plugin.

Monitoring Alert Recommendations

Prometheus rules to catch frequent restarts, pods not ready, services without endpoints, etc. Example:

groups:
- name: kubernetes-pods
  rules:
  - alert: PodRestartingFrequently
    expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarting frequently"
      description: "Pod has restarted {{ $value }} times in the last 15 minutes"
  - alert: PodNotReady
    expr: kube_pod_status_ready{condition="false"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready"
- name: kubernetes-services
  rules:
  - alert: ServiceWithoutEndpoints
    expr: kube_endpoint_address_available == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Service {{ $labels.namespace }}/{{ $labels.service }} has no available Endpoints"

Best‑Practice Checklist

✓ Configure liveness and readiness probes for all containers

✓ Set sensible resource requests and limits

✓ Use structured (JSON) logs with TraceID/RequestID

✓ Define clear labels and annotations for selector matching

✓ Prefer automated diagnostic scripts over manual command sequences

✓ Keep complete diagnostic logs for post‑mortem analysis

✓ Maintain a fault‑classification SOP and run regular drills

✓ Use visual tools (K9s, Lens) for quick state overview

✓ Aggregate logs centrally (ELK, Loki, Stern)

✓ Set up multi‑layer alerts (Pod, Node, Service) with proper thresholds

Conclusion and Outlook

Reducing troubleshooting time from 30 minutes to 3 minutes is not just a speed win; it represents a systematic shift in how we approach Kubernetes incidents. By combining automation, efficient command shortcuts, visual tools, log aggregation, and standardised response procedures, we achieve a robust, repeatable workflow that dramatically cuts business impact.

Future directions include AI‑assisted root‑cause analysis, deeper service‑mesh observability (Istio/Linkerd), eBPF‑based performance tracing, and GitOps‑driven configuration management. Regardless of the tools, a solid understanding of Kubernetes internals and a disciplined process remain the foundation of effective operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MonitoringautomationdevopsTroubleshootingscriptsk9scloud‑native
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.