Mastering Kubernetes Performance Bottlenecks: The Ultimate Troubleshooting Guide
This comprehensive guide walks you through the seven key performance metrics, resource, application, and system component indicators, and provides step‑by‑step methods, advanced tips, and tool recommendations for diagnosing and resolving Kubernetes performance bottlenecks from cluster‑wide to pod‑level details.
Core Idea
First examine resources, then applications, and finally system components. Performance bottlenecks usually result from multiple factors and must be investigated layer by layer, from the whole cluster down to individual nodes and pods.
1. Seven Key Performance Indicators
The metrics are grouped into three categories: Resource metrics , Application metrics , and System component metrics .
1.1 Resource Metrics – "Are resources sufficient?"
CPU usage vs. requests/limits
Definition: actual CPU consumption compared with the pod’s request and limit.
Importance: >85% usage may slow the app; throttling occurs when the limit is reached, causing a sharp performance drop.
Investigation:
Run kubectl top pods for real‑time usage.
Use Prometheus to view historical trends and correlate with request/limit.
Check container_cpu_cfs_throttled_seconds_total and container_cpu_cfs_throttled_periods_total to confirm throttling.
Advanced tip: Even if kubectl top shows low usage, throttling may still limit performance, creating a false‑positive impression.
Memory usage vs. requests/limits
Definition: memory consumption compared with the pod’s request and limit.
Importance: High usage can trigger the OOM Killer, terminating containers.
Investigation:
Run kubectl top pods.
Run kubectl describe pod to see OOMKilled restarts.
Advanced tip: Monitor in‑process memory leaks.
Java: jstat, jmap, JMX Exporter.
Go: pprof.
Node.js: clinic.
Disk I/O
Definition: read/write rate, throughput, and latency of persistent volumes (PV).
Importance: Slow disks block processes and degrade application performance.
Investigation:
Node level: iostat -x 1 to view %util, await, avgqu‑sz.
PV level: check CSI plugin or storage backend metrics.
Tip: await >50 ms or high queue length indicates a disk bottleneck.
Network bandwidth & packet rate
Definition: node traffic (bytes/s) and packet rate (packets/s).
Importance: Saturated bandwidth leads to high latency and packet loss; excessive packet processing consumes CPU.
Investigation:
Node: iftop / nload.
Prometheus metric: node_network_receive_bytes_total.
Advanced tip: Distinguish pod‑to‑pod internal traffic from egress traffic; use mtr or iperf3 for network diagnostics.
2. Application Metrics – "Is the application healthy?"
2.1 Latency and error rate
Definition: request latency (P95/P99) and HTTP 5xx error rate.
Importance: Directly impacts user experience; even with ample resources, slow queries or downstream failures cause high latency.
Investigation:
APM tools such as SkyWalking or Pinpoint.
Ingress/Istio metrics.
Advanced tip: Apply the RED model (Rate, Errors, Duration, Saturation) to monitor application health.
2.2 Pod restart count
Definition: historical restart count of a pod.
Importance: Frequent restarts indicate instability; possible causes include Liveness probe failures, OOMKilled, or application crashes.
Investigation: kubectl get pods to view RESTARTS. kubectl describe pod <pod-name> to see events. kubectl logs --previous to view the previous container logs.
3. System Component Metrics – "Is the Kubernetes control plane healthy?"
3.1 API Server latency
Definition: request latency of the API server, especially LIST/GET operations.
Importance: The API server is the cluster brain; high latency slows pod status updates and scheduling.
Investigation:
Run kubectl get --raw "/metrics" and examine apiserver_request_duration_seconds_bucket.
Check readiness with kubectl get --raw "/readyz".
Advanced tip: High latency often correlates with etcd pressure; monitor etcd_disk_wal_fsync_duration_seconds, etcd_disk_backend_commit_duration_seconds, and etcd_network_peer_round_trip_time_sec.
4. Advanced Dimensions
Scheduling & resource allocation
Investigate pod scheduling latency and node resource fragmentation using kubectl get events --sort-by=.lastTimestamp and kubectl describe nodes.
Service & network layer
Check ClusterIP/NodePort/Ingress latency and CoreDNS performance (e.g., kubectl logs -n kube-system <coredns-pod>, metric coredns_request_duration_seconds).
Storage & StatefulSet
Monitor PV IOPS, throughput, latency, and attach/detach delays.
Controller & replica management
Observe ReplicaSet/Deployment scaling latency and verify HPA effectiveness with kubectl get hpa.
Node health & OS
Watch CPU steal, iowait, load, and system events such as OOM, kernel panic, or network interruptions.
Container runtime
Identify slow container start‑up caused by image pull or runtime issues; view runtime logs with crictl ps and crictl logs.
Application architecture & dependencies
Track external dependency latency (databases, caches, third‑party APIs), thread/connection pool exhaustion, and slow queries.
5. Troubleshooting Process (Macro → Micro)
Macro layer
Cluster health: kubectl get nodes, kubectl get componentstatus.
Resource trends: Prometheus + Grafana dashboards.
Mid layer
Inspect Namespace/Deployment, pod status, restart counts, HPA/replica set status, Service/Ingress latency.
Micro layer
Pod/container details: CPU, memory, I/O, network, application logs, probe status.
Node OS metrics: CPU steal, iowait, load, network packet loss, disk queue length.
Control plane: API server latency, etcd performance, Scheduler/Controller Manager pressure.
6. Recommended Tools
Cluster monitoring – Prometheus + Grafana (CPU, memory, disk, network, pod status trends).
Log aggregation – Loki or ELK (centralized log analysis).
Tracing – SkyWalking, Jaeger, Pinpoint (request tracing, slow request identification).
Node diagnostics – htop, iostat, iftop, netstat, vmstat (node performance, network, disk analysis).
Container diagnostics – crictl, docker logs (runtime issues).
Network diagnostics – mtr, iperf3 (latency, packet loss).
Storage diagnostics – CSI metrics, iostat (PV I/O performance).
Control‑plane metrics – kubectl get --raw /metrics (API server, etcd).
7. Practical Experience Summary
Follow the investigation order: Resources → Applications → System components → External dependencies.
Combine metrics: CPU, memory, I/O, network, latency, error rate using the RED model.
Use event‑driven debugging for Pod Pending, CrashLoopBackOff, OOMKilled, or probe failures.
Correlate monitoring data with logs (historical trends + real‑time logs) to pinpoint issues quickly.
Perform cross‑layer analysis covering container performance, node resources, network, storage, control plane, and application architecture.
This document consolidates the original seven key indicators, advanced troubleshooting techniques for CPU/memory/I/O/network/application/control‑plane, higher‑level dimensions (scheduling, Service, storage, controllers, nodes, containers, architecture), a macro‑to‑micro investigation workflow, tool recommendations, and hands‑on experience into a complete, systematic Kubernetes performance troubleshooting guide.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
