How I Cut Kubernetes Troubleshooting Time from 30 Minutes to 3 Minutes
This article presents a complete, step‑by‑step method for reducing average Kubernetes fault‑diagnosis time from half an hour to under three minutes, covering the root causes of slow manual debugging, a one‑click diagnostic script, efficient kubectl shortcuts, visual tools, log aggregation, automated response workflows, and real‑world case studies.
From 30 Minutes to 3 Minutes: How I Optimized Kubernetes Troubleshooting
Introduction
"Pod crashed again! Why is Ingress unreachable? Why does a PV mount fail?" These alerts woke me up at 3 am. I ran the usual sequence of kubectl get pods, kubectl describe, and kubectl logs, but after 30 minutes the problem was still not located. After six months of practice and optimisation I reduced the average troubleshooting time from 30 minutes to under 3 minutes, with core‑issue identification often under 1 minute. This article shares the complete method, including an automated diagnostic script, efficient command sets, visual tools, and a full troubleshooting workflow.
Technical Background: Why Traditional Kubernetes Troubleshooting Is Slow
Kubernetes Fault Complexity
Kubernetes is a multi‑layered system. A simple application failure may involve any of the following layers:
Application layer : process state, config, logs inside the container
Pod layer : container status, restart count, resource limits, health checks
Service layer : Service, Endpoints, load balancing
Network layer : CNI plugin, NetworkPolicy, Ingress, DNS
Storage layer : PV, PVC, StorageClass, mount status
Scheduling layer : node resources, taints, affinity rules
Control plane : API Server, Controller Manager, Scheduler
Because a fault can exist in any layer—or be a combination of several layers—traditional debugging requires checking each layer sequentially, which is time‑consuming and error‑prone.
Pain Points of Traditional Debugging
Verbose and repetitive commands
# Typical manual debugging flow
kubectl get pods -n production
kubectl describe pod nginx-xxx -n production
kubectl logs nginx-xxx -n production
kubectl logs nginx-xxx -n production --previous
kubectl get svc -n production
kubectl describe svc nginx-service -n production
kubectl get endpoints -n production
kubectl get ingress -n production
# ... more than 10 commands for a basic diagnosisScattered log information
Pod logs are on each node
System events are in the Event object
Audit logs are in the API Server
Network logs are in the CNI plugin
Application logs are inside the container
There is no unified log query entry, so locating the problem requires jumping between many places.
Difficult to correlate status information
Pod abnormal → need to check node status
Service unreachable → need to check Endpoints
Ingress 502 → need to check Service, Pod, network
Storage mount failure → need to check PVC, PV, StorageClass
Correlating these pieces requires multiple commands and manual analysis.
Lack of historical comparison
Did resource quota change?
Did replica count change?
Did configuration change?
Did image version change?
Without historical data it is hard to know whether a change caused the issue.
Real Incident: 30‑Minute Painful Debugging
At 14:00 an alert indicated a massive API timeout. The manual steps taken were:
14:02 kubectl get pods – three pods looked normal
14:05 kubectl describe pod – no obvious abnormality
14:08 kubectl logs – no clear error
14:12 kubectl get svc & kubectl get endpoints – only one endpoint
14:15 kubectl describe pod – found two pods with Ready=0/1
14:18 kubectl describe – health check failed
14:22 kubectl exec – port not reachable
14:25 kubectl logs – database connection timeout
14:28 kubectl describe configmap – database config was wrong
The whole process took 30 minutes, even though the root cause was simple.
Core Optimisations: From 30 Minutes to 3 Minutes
Strategy 1 – One‑Click Diagnostic Script (Core Optimisation)
A comprehensive Bash script automates 90 % of the manual steps. It checks cluster connectivity, namespace existence, pod status, recent logs, services, ingress, storage, node health, recent events, and finally produces a concise summary.
#!/bin/bash
################################################################################
# Script name: k8s-quick-diagnose.sh
# Description: Kubernetes quick diagnostic script
# Version: v2.0
# Usage: ./k8s-quick-diagnose.sh [namespace] [pod-name-pattern]
################################################################################
# Configuration area
NAMESPACE="${1:-default}"
POD_PATTERN="${2:-.*}"
REPORT_FILE="/tmp/k8s-diagnose-$(date +%Y%m%d_%H%M%S).log"
# Colour definitions (removed for brevity)
log(){ echo -e "[$(date +"%H:%M:%S")] $1" | tee -a "$REPORT_FILE"; }
log_section(){ echo; echo "═══════════════════════════════════════════════════════════"; echo -e "\033[0;34m $1\033[0m" | tee -a "$REPORT_FILE"; echo "═══════════════════════════════════════════════════════════"; }
log_error(){ echo -e "\033[0;31m[✗ ERROR] $1\033[0m" | tee -a "$REPORT_FILE"; }
log_warning(){ echo -e "\033[0;33m[⚠ WARNING] $1\033[0m" | tee -a "$REPORT_FILE"; }
log_ok(){ echo -e "\033[0;32m[✓ OK] $1\033[0m" | tee -a "$REPORT_FILE"; }
log_info(){ echo -e "\033[0;34m[ℹ INFO] $1\033[0m" | tee -a "$REPORT_FILE"; }
check_cluster_connection(){
log_section "1. Cluster connection check"
if ! kubectl cluster-info &>/dev/null; then
log_error "Cannot connect to Kubernetes cluster"
log_info "Check KUBECONFIG: echo $KUBECONFIG"
exit 1
fi
CONTEXT=$(kubectl config current-context)
CLUSTER=$(kubectl config view -o jsonpath="{.contexts[?(@.name==\"$CONTEXT\")].context.cluster}")
log_ok "Cluster connection OK"
log "Current Context: $CONTEXT"
log "Current Cluster: $CLUSTER"
SERVER_VERSION=$(kubectl version --short 2>/dev/null | grep Server | awk '{print $3}')
log "Cluster version: $SERVER_VERSION"
}
check_namespace(){
log_section "2. Namespace check"
if ! kubectl get namespace "$NAMESPACE" &>/dev/null; then
log_error "Namespace $NAMESPACE does not exist"
log "Available namespaces:"
kubectl get namespaces | awk '{print " "$0}' | tee -a "$REPORT_FILE"
exit 1
fi
log_ok "Namespace $NAMESPACE exists"
QUOTA=$(kubectl get resourcequota -n "$NAMESPACE" -o name 2>/dev/null)
if [ -n "$QUOTA" ]; then
log "ResourceQuota:"
kubectl describe resourcequota -n "$NAMESPACE" | grep -A 10 "Used" | awk '{print " "$0}' | tee -a "$REPORT_FILE"
fi
}
check_pods(){
log_section "3. Pod status check"
PODS=$(kubectl get pods -n "$NAMESPACE" -o json | jq -r ".items[] | select(.metadata.name | test(\"$POD_PATTERN\")) | .metadata.name")
if [ -z "$PODS" ]; then
log_warning "No pod matches pattern '$POD_PATTERN'"
return
fi
POD_COUNT=$(echo "$PODS" | wc -l)
log_info "Found $POD_COUNT pods"
echo "$PODS" | while read pod; do
STATUS=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.status.phase}')
READY=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
RESTARTS=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[0].restartCount}')
log "Pod: $pod"
case "$STATUS" in
"Running")
if [ "$READY" == "True" ]; then
log_ok " Status: $STATUS (Ready)"
else
log_warning " Status: $STATUS (Not Ready)"
CONTAINER_STATUS=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{range .status.containerStatuses[*]}Container:{.name} Ready:{.ready} State:{.state}
{end}')
echo "$CONTAINER_STATUS" | while read line; do log_warning " $line"; done
fi
;;
"Pending")
log_error " Status: $STATUS (Pending scheduling)"
EVENTS=$(kubectl get events -n "$NAMESPACE" --field-selector involvedObject.name="$pod" --sort-by='.lastTimestamp' | tail -5)
log " Recent events:"; echo "$EVENTS" | awk '{print " "$0}' | tee -a "$REPORT_FILE"
;;
"Failed"|"CrashLoopBackOff"|"Error")
log_error " Status: $STATUS"
;;
*)
log_warning " Status: $STATUS"
;;
esac
if [ "$RESTARTS" -gt 0 ]; then
log_warning " Restarts: $RESTARTS"
TERMINATED_REASON=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}')
if [ -n "$TERMINATED_REASON" ]; then
log_warning " Last termination reason: $TERMINATED_REASON"
fi
fi
CPU_REQUEST=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].resources.requests.cpu}')
MEM_REQUEST=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].resources.requests.memory}')
CPU_LIMIT=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].resources.limits.cpu}')
MEM_LIMIT=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].resources.limits.memory}')
log " Resource config: Request CPU=$CPU_REQUEST Memory=$MEM_REQUEST"
log " Resource config: Limit CPU=$CPU_LIMIT Memory=$MEM_LIMIT"
NODE=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.nodeName}')
POD_IP=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.status.podIP}')
log " Node: $NODE"
log " Pod IP: $POD_IP"
HAS_LIVENESS=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].livenessProbe}')
HAS_READINESS=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].readinessProbe}')
if [ -n "$HAS_LIVENESS" ]; then log " Liveness probe: Configured"; else log_warning " Liveness probe: Not configured"; fi
if [ -n "$HAS_READINESS" ]; then log " Readiness probe: Configured"; else log_warning " Readiness probe: Not configured"; fi
done
}
check_pod_logs(){
log_section "4. Pod logs (last 50 lines)"
PODS=$(kubectl get pods -n "$NAMESPACE" -o json | jq -r ".items[] | select(.metadata.name | test(\"$POD_PATTERN\")) | .metadata.name")
echo "$PODS" | while read pod; do
log "=== Pod: $pod logs ==="
CONTAINERS=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.spec.containers[*].name}')
CONTAINER_COUNT=$(echo "$CONTAINERS" | wc -w)
if [ $CONTAINER_COUNT -gt 1 ]; then
log "Pod has $CONTAINER_COUNT containers: $CONTAINERS"
for container in $CONTAINERS; do
log "--- Container: $container ---"
kubectl logs "$pod" -n "$NAMESPACE" -c "$container" --tail=20 2>&1 | awk '{print " "$0}' | tee -a "$REPORT_FILE"
done
else
kubectl logs "$pod" -n "$NAMESPACE" --tail=20 2>&1 | awk '{print " "$0}' | tee -a "$REPORT_FILE"
fi
RESTARTS=$(kubectl get pod "$pod" -n "$NAMESPACE" -o jsonpath='{.status.containerStatuses[0].restartCount}')
if [ $RESTARTS -gt 0 ]; then
log_warning "Detected restarts, checking previous crash logs"
kubectl logs "$pod" -n "$NAMESPACE" --previous --tail=20 2>&1 | awk '{print " "$0}' | tee -a "$REPORT_FILE"
fi
done
}
check_services(){
log_section "5. Service and Endpoints check"
SERVICES=$(kubectl get svc -n "$NAMESPACE" -o name 2>/dev/null)
if [ -z "$SERVICES" ]; then
log_warning "No Service in namespace $NAMESPACE"
return
fi
echo "$SERVICES" | while read svc_name; do
SVC=$(echo "$svc_name" | cut -d'/' -f2)
log "Service: $SVC"
TYPE=$(kubectl get svc "$SVC" -n "$NAMESPACE" -o jsonpath='{.spec.type}')
CLUSTER_IP=$(kubectl get svc "$SVC" -n "$NAMESPACE" -o jsonpath='{.spec.clusterIP}')
PORTS=$(kubectl get svc "$SVC" -n "$NAMESPACE" -o jsonpath='{.spec.ports[*].port}')
SELECTOR=$(kubectl get svc "$SVC" -n "$NAMESPACE" -o jsonpath='{.spec.selector}')
log " Type: $TYPE"
log " ClusterIP: $CLUSTER_IP"
log " Ports: $PORTS"
log " Selector: $SELECTOR"
ENDPOINTS=$(kubectl get endpoints "$SVC" -n "$NAMESPACE" -o jsonpath='{.subsets[*].addresses[*].ip}' 2>/dev/null)
ENDPOINT_COUNT=$(echo "$ENDPOINTS" | wc -w)
if [ $ENDPOINT_COUNT -eq 0 ]; then
log_error " Endpoints: 0 (no backend pods)"
log_warning " Possible reasons: 1) Pods not ready 2) Selector mismatch 3) Pods missing"
if [ -n "$SELECTOR" ]; then
SELECTOR_STR=$(echo "$SELECTOR" | sed 's/map\[//g' | sed 's/\]//g' | sed 's/ /,/g')
MATCHING_PODS=$(kubectl get pods -n "$NAMESPACE" -l "$SELECTOR_STR" -o name 2>/dev/null | wc -l)
log " Matching pod count: $MATCHING_PODS"
fi
else
log_ok " Endpoints: $ENDPOINT_COUNT"
log " Backend IPs: $ENDPOINTS"
fi
if [ "$TYPE" == "LoadBalancer" ]; then
EXTERNAL_IP=$(kubectl get svc "$SVC" -n "$NAMESPACE" -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
if [ -n "$EXTERNAL_IP" ]; then
log_ok " External IP: $EXTERNAL_IP"
else
log_warning " External IP: Pending (awaiting allocation)"
fi
fi
done
}
check_ingress(){
log_section "6. Ingress check"
INGRESSES=$(kubectl get ingress -n "$NAMESPACE" -o name 2>/dev/null)
if [ -z "$INGRESSES" ]; then
log "No Ingress in namespace $NAMESPACE"
return
fi
echo "$INGRESSES" | while read ing_name; do
ING=$(echo "$ing_name" | cut -d'/' -f2)
log "Ingress: $ING"
CLASS=$(kubectl get ingress "$ING" -n "$NAMESPACE" -o jsonpath='{.spec.ingressClassName}')
HOSTS=$(kubectl get ingress "$ING" -n "$NAMESPACE" -o jsonpath='{.spec.rules[*].host}')
log " IngressClass: $CLASS"
log " Hosts: $HOSTS"
RULES=$(kubectl get ingress "$ING" -n "$NAMESPACE" -o jsonpath="{range .spec.rules[*]}{.host}{' -> '}{range .http.paths[*]}{.path}{' -> '}{.backend.service.name}{':'}.backend.service.port.number}{'
'}{end}{end}")
log " Routing rules:"
echo "$RULES" | while read rule; do log " $rule"; done
ADDRESS=$(kubectl get ingress "$ING" -n "$NAMESPACE" -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
if [ -n "$ADDRESS" ]; then
log_ok " Address: $ADDRESS"
else
log_warning " Address: Pending (awaiting Ingress controller allocation)"
fi
TLS=$(kubectl get ingress "$ING" -n "$NAMESPACE" -o jsonpath='{.spec.tls}')
if [ -n "$TLS" ]; then
log_ok " TLS: Configured"
TLS_SECRETS=$(kubectl get ingress "$ING" -n "$NAMESPACE" -o jsonpath='{.spec.tls[*].secretName}')
log " Secret: $TLS_SECRETS"
else
log " TLS: Not configured"
fi
done
}
check_storage(){
log_section "7. Storage check (PVC/PV)"
PVCS=$(kubectl get pvc -n "$NAMESPACE" -o name 2>/dev/null)
if [ -z "$PVCS" ]; then
log "No PVC in namespace $NAMESPACE"
return
fi
echo "$PVCS" | while read pvc_name; do
PVC=$(echo "$pvc_name" | cut -d'/' -f2)
log "PVC: $PVC"
STATUS=$(kubectl get pvc "$PVC" -n "$NAMESPACE" -o jsonpath='{.status.phase}')
CAPACITY=$(kubectl get pvc "$PVC" -n "$NAMESPACE" -o jsonpath='{.status.capacity.storage}')
SC=$(kubectl get pvc "$PVC" -n "$NAMESPACE" -o jsonpath='{.spec.storageClassName}')
PV=$(kubectl get pvc "$PVC" -n "$NAMESPACE" -o jsonpath='{.spec.volumeName}')
case "$STATUS" in
"Bound") log_ok " Status: $STATUS";;
"Pending")
log_error " Status: $STATUS (waiting for bind)"
if [ -n "$SC" ]; then
SC_EXISTS=$(kubectl get storageclass "$SC" -o name 2>/dev/null)
if [ -z "$SC_EXISTS" ]; then log_error " StorageClass '$SC' does not exist"; fi
else
log_warning " No StorageClass specified"
fi
;;
*) log_warning " Status: $STATUS";;
esac
log " Capacity: $CAPACITY"
log " StorageClass: $SC"
log " Bound PV: $PV"
if [ -n "$PV" ]; then
PV_STATUS=$(kubectl get pv "$PV" -o jsonpath='{.status.phase}' 2>/dev/null)
PV_PATH=$(kubectl get pv "$PV" -o jsonpath='{.spec.hostPath.path}' 2>/dev/null)
log " PV status: $PV_STATUS"
if [ -n "$PV_PATH" ]; then log " Host path: $PV_PATH"; fi
fi
done
}
check_nodes(){
log_section "8. Node status check"
NODES=$(kubectl get pods -n "$NAMESPACE" -o json | jq -r ".items[] | select(.metadata.name | test(\"$POD_PATTERN\")) | .spec.nodeName" | sort -u)
if [ -z "$NODES" ]; then
log_warning "No related nodes found"
return
fi
echo "$NODES" | while read node; do
log "Node: $node"
STATUS=$(kubectl get node "$node" -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}')
if [ "$STATUS" == "True" ]; then log_ok " Status: Ready"; else log_error " Status: Not Ready"; fi
CPU_CAP=$(kubectl get node "$node" -o jsonpath='{.status.capacity.cpu}')
MEM_CAP=$(kubectl get node "$node" -o jsonpath='{.status.capacity.memory}')
CPU_ALLOC=$(kubectl get node "$node" -o jsonpath='{.status.allocatable.cpu}')
MEM_ALLOC=$(kubectl get node "$node" -o jsonpath='{.status.allocatable.memory}')
log " Capacity: CPU=$CPU_CAP Memory=$MEM_CAP"
log " Allocatable: CPU=$CPU_ALLOC Memory=$MEM_ALLOC"
DISK_PRESSURE=$(kubectl get node "$node" -o jsonpath='{.status.conditions[?(@.type=="DiskPressure")].status}')
MEM_PRESSURE=$(kubectl get node "$node" -o jsonpath='{.status.conditions[?(@.type=="MemoryPressure")].status}')
PID_PRESSURE=$(kubectl get node "$node" -o jsonpath='{.status.conditions[?(@.type=="PIDPressure")].status}')
[ "$DISK_PRESSURE" == "True" ] && log_error " Disk pressure: Yes"
[ "$MEM_PRESSURE" == "True" ] && log_error " Memory pressure: Yes"
[ "$PID_PRESSURE" == "True" ] && log_error " PID pressure: Yes"
POD_COUNT=$(kubectl get pods -n "$NAMESPACE" --field-selector spec.nodeName="$node" 2>/dev/null | grep -v NAME | wc -l)
log " Pods on node: $POD_COUNT"
done
}
check_events(){
log_section "9. Recent events (last 30)"
EVENTS=$(kubectl get events -n "$NAMESPACE" --sort-by='.lastTimestamp' | tail -30)
if [ -z "$EVENTS" ]; then
log "No recent events"
return
fi
WARNING_COUNT=$(echo "$EVENTS" | grep -c "Warning" || echo 0)
ERROR_COUNT=$(echo "$EVENTS" | grep -c "Error" || echo 0)
log "Warning events: $WARNING_COUNT"
log "Error events: $ERROR_COUNT"
echo "$EVENTS" | awk '{print " "$0}' | tee -a "$REPORT_FILE"
log_info "Key events:"
echo "$EVENTS" | grep -E "Failed|Error|BackOff|Unhealthy|FailedMount" | awk '{print " "$0}' | tee -a "$REPORT_FILE"
}
generate_summary(){
log_section "10. Diagnostic summary"
ERROR_COUNT=$(grep -c "\[✗ ERROR\]" "$REPORT_FILE" || echo 0)
WARNING_COUNT=$(grep -c "\[⚠ WARNING\]" "$REPORT_FILE" || echo 0)
log "Diagnostic completed at $(date '+%Y-%m-%d %H:%M:%S')"
log "Namespace: $NAMESPACE"
log "Pod pattern: $POD_PATTERN"
if [ $ERROR_COUNT -gt 0 ]; then log_error "Found $ERROR_COUNT errors"; fi
if [ $WARNING_COUNT -gt 0 ]; then log_warning "Found $WARNING_COUNT warnings"; fi
if [ $ERROR_COUNT -eq 0 ] && [ $WARNING_COUNT -eq 0 ]; then log_ok "No obvious issues"; fi
log "Full report saved to: $REPORT_FILE"
log_section "Problem localisation hints"
if grep -q "Endpoints: 0" "$REPORT_FILE"; then
log_error "Service has no available Endpoints"
log " → Check if Pods are Ready"
log " → Verify Service selector matches Pod labels"
fi
if grep -q "CrashLoopBackOff\|Failed" "$REPORT_FILE"; then
log_error "Pod start failure or crash detected"
log " → View pod logs: kubectl logs <pod> -n $NAMESPACE"
log " → Check image correctness"
log " → Verify container start command and env vars"
fi
if grep -q "Pending.*waiting for scheduling" "$REPORT_FILE"; then
log_error "Pod cannot be scheduled"
log " → Check node resource availability"
log " → Verify PV/PVC binding"
log " → Check node affinity and taint tolerations"
fi
if grep -q "Not Ready" "$REPORT_FILE"; then
log_warning "Pod or node not Ready"
log " → Check health‑check configuration"
log " → Verify application start‑up time vs initialDelaySeconds"
fi
}
main(){
echo "════════════════════════════════════════════════════════════════"
echo " Kubernetes Quick Diagnose Script v2.0"
echo "════════════════════════════════════════════════════════════════"
echo ""
if ! command -v kubectl &>/dev/null; then echo "Error: kubectl not installed"; exit 1; fi
if ! command -v jq &>/dev/null; then echo "Warning: jq not installed, some features may be limited"; fi
check_cluster_connection
check_namespace
check_pods
check_pod_logs
check_services
check_ingress
check_storage
check_nodes
check_events
generate_summary
echo ""
echo "════════════════════════════════════════════════════════════════"
echo " Diagnosis completed!"
echo "════════════════════════════════════════════════════════════════"
}
main "$@"Usage
# Make the script executable
chmod +x k8s-quick-diagnose.sh
# Diagnose all pods in the default namespace
./k8s-quick-diagnose.sh
# Diagnose a specific namespace
./k8s-quick-diagnose.sh production
# Diagnose pods matching a regex in a namespace
./k8s-quick-diagnose.sh production "nginx.*"
# Run and stream output to a log file
./k8s-quick-diagnose.sh production | tee diagnosis.logStrategy 2 – Efficient kubectl Command Cheatsheet
Define short aliases to speed up frequent operations:
# Add to ~/.bashrc or ~/.zshrc
alias k='kubectl'
alias kg='kubectl get'
alias kd='kubectl describe'
alias kdel='kubectl delete'
alias kl='kubectl logs'
alias kex='kubectl exec -it'
alias kgp='kubectl get pods'
alias kgn='kubectl get nodes'
alias kgs='kubectl get svc'
alias kgi='kubectl get ingress'
# Advanced aliases
alias kgpa='kubectl get pods --all-namespaces'
alias kgpw='kubectl get pods --watch'
alias kdelp='kubectl delete pod'
alias klf='kubectl logs -f'
alias keti='kubectl exec -it'
# Quick namespace switch
alias kns='kubectl config set-context --current --namespace'
# Resource sorting
alias kgpnode='kubectl get pods -o wide --sort-by=.spec.nodeName'
alias kgprestart='kubectl get pods --sort-by=.status.containerStatuses[0].restartCount'Strategy 3 – Visual Tools to Accelerate Diagnosis
K9s – Powerful Terminal UI
K9s provides a real‑time view of cluster resources. Highlights:
Fast navigation with : commands (e.g., :pods, :svc, :deploy)
Automatic refresh and colour‑coded abnormal states
Quick actions: l for logs, d for describe, e for edit, s for shell, Ctrl‑K to delete, / to filter
Multi‑container support with easy switching
Lens – Enterprise‑grade Desktop Client
Lens offers a graphical IDE for Kubernetes with features such as multi‑cluster management, topology visualisation, built‑in Prometheus monitoring, and integrated terminal.
Strategy 4 – Log Aggregation for Rapid Root‑Cause Identification
Option 1: ELK/EFK Stack
Architecture: Pods → Fluent Bit → Elasticsearch → Kibana. Deploy Fluent Bit as a DaemonSet to ship container logs to Elasticsearch and visualise them in Kibana.
Option 2: Loki (lighter weight)
Loki integrates tightly with Grafana. Deploy via Helm:
# Helm installation
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki-stack \
--namespace=logging \
--create-namespace \
--set grafana.enabled=trueQuery examples:
# Error logs in a namespace
{namespace="production"} |= "error" | json
# Logs from a specific pod
{pod=~"nginx-.*"} |= "exception"
# Error rate over 5 minutes
rate({namespace="production"} |= "error" [5m])Option 3: Stern – Multi‑Pod Log Aggregation
Stern streams logs from multiple pods simultaneously.
# Install Stern
brew install stern # macOS
# Or download binary for Linux
curl -LO https://github.com/stern/stern/releases/download/v1.25.0/stern_1.25.0_linux_amd64.tar.gz
# Usage example
stern nginx -n production # All nginx pods
stern nginx -n production -c nginx # Specific container
stern nginx -n production --tail=100 | grep -i errorStrategy 5 – Automated Incident Response Workflow
A decision tree classifies common failures and triggers specialised scripts.
# Failure classification decision tree
# Alert
# ├─ Pod abnormal
# │ ├─ CrashLoopBackOff → check logs → verify start command/config
# │ ├─ ImagePullBackOff → check image name/registry credentials
# │ ├─ Pending → check node resources/PV status/affinity
# │ └─ Error → inspect Events → check resource limits
# ├─ Service unreachable
# │ ├─ Check Endpoints count
# │ ├─ Verify Pod Ready status
# │ └─ Ensure selector matches pod labels
# ├─ Ingress 502/503
# │ ├─ Check Service → Endpoints → Pod health
# │ ├─ Inspect Ingress controller logs
# │ └─ Validate routing rules
# └─ Storage issues
# ├─ PVC Pending → verify StorageClass/PV availability
# ├─ Mount failure → check node permissions/path existence
# └─ Out of space → expand or clean upExample script for pod crash handling ( pod-crash-handler.sh) and service troubleshooting ( service-troubleshoot.sh) are provided in the original article.
Practical Case: 30‑Minute vs 3‑Minute Real‑World Comparison
Before Optimisation (30 minutes)
At 19:05 an alert indicated high error rate on the payment‑API. The engineer manually executed a series of kubectl commands, inspected pods, services, endpoints, and logs, discovered a database connection timeout caused by a full disk, and finally cleaned up the disk. Total time: 30 minutes.
After Optimisation (3 minutes)
Using K9s the engineer instantly located the failing pod, viewed logs, switched to the related Service, saw missing endpoints, inspected the problematic database pods, entered the container shell, identified the full disk, ran an automated cleanup script, and the system recovered within 3 minutes.
Tools & Scripts Summary
A cheat‑sheet ( k8s-cheatsheet.md) consolidates the most useful kubectl one‑liners, and an installation script ( install-k8s-tools.sh) sets up K9s, Stern, kubectx/kubens, jq, and the kubectl‑tree plugin.
Monitoring Alert Recommendations
Prometheus rules to catch frequent restarts, pods not ready, services without endpoints, etc. Example:
groups:
- name: kubernetes-pods
rules:
- alert: PodRestartingFrequently
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} restarting frequently"
description: "Pod has restarted {{ $value }} times in the last 15 minutes"
- alert: PodNotReady
expr: kube_pod_status_ready{condition="false"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} not ready"
- name: kubernetes-services
rules:
- alert: ServiceWithoutEndpoints
expr: kube_endpoint_address_available == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.namespace }}/{{ $labels.service }} has no available Endpoints"Best‑Practice Checklist
✓ Configure liveness and readiness probes for all containers
✓ Set sensible resource requests and limits
✓ Use structured (JSON) logs with TraceID/RequestID
✓ Define clear labels and annotations for selector matching
✓ Prefer automated diagnostic scripts over manual command sequences
✓ Keep complete diagnostic logs for post‑mortem analysis
✓ Maintain a fault‑classification SOP and run regular drills
✓ Use visual tools (K9s, Lens) for quick state overview
✓ Aggregate logs centrally (ELK, Loki, Stern)
✓ Set up multi‑layer alerts (Pod, Node, Service) with proper thresholds
Conclusion and Outlook
Reducing troubleshooting time from 30 minutes to 3 minutes is not just a speed win; it represents a systematic shift in how we approach Kubernetes incidents. By combining automation, efficient command shortcuts, visual tools, log aggregation, and standardised response procedures, we achieve a robust, repeatable workflow that dramatically cuts business impact.
Future directions include AI‑assisted root‑cause analysis, deeper service‑mesh observability (Istio/Linkerd), eBPF‑based performance tracing, and GitOps‑driven configuration management. Regardless of the tools, a solid understanding of Kubernetes internals and a disciplined process remain the foundation of effective operations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
