Cloud Native 11 min read

Mastering Kubernetes Performance Bottlenecks: The Ultimate Troubleshooting Guide

This comprehensive guide walks you through the seven key performance metrics, resource, application, and system component indicators, and provides step‑by‑step methods, advanced tips, and tool recommendations for diagnosing and resolving Kubernetes performance bottlenecks from cluster‑wide to pod‑level details.

Ray's Galactic Tech
Ray's Galactic Tech
Ray's Galactic Tech
Mastering Kubernetes Performance Bottlenecks: The Ultimate Troubleshooting Guide

Core Idea

First examine resources, then applications, and finally system components. Performance bottlenecks usually result from multiple factors and must be investigated layer by layer, from the whole cluster down to individual nodes and pods.

1. Seven Key Performance Indicators

The metrics are grouped into three categories: Resource metrics , Application metrics , and System component metrics .

1.1 Resource Metrics – "Are resources sufficient?"

CPU usage vs. requests/limits

Definition: actual CPU consumption compared with the pod’s request and limit.

Importance: >85% usage may slow the app; throttling occurs when the limit is reached, causing a sharp performance drop.

Investigation:

Run kubectl top pods for real‑time usage.

Use Prometheus to view historical trends and correlate with request/limit.

Check container_cpu_cfs_throttled_seconds_total and container_cpu_cfs_throttled_periods_total to confirm throttling.

Advanced tip: Even if kubectl top shows low usage, throttling may still limit performance, creating a false‑positive impression.

Memory usage vs. requests/limits

Definition: memory consumption compared with the pod’s request and limit.

Importance: High usage can trigger the OOM Killer, terminating containers.

Investigation:

Run kubectl top pods.

Run kubectl describe pod to see OOMKilled restarts.

Advanced tip: Monitor in‑process memory leaks.

Java: jstat, jmap, JMX Exporter.

Go: pprof.

Node.js: clinic.

Disk I/O

Definition: read/write rate, throughput, and latency of persistent volumes (PV).

Importance: Slow disks block processes and degrade application performance.

Investigation:

Node level: iostat -x 1 to view %util, await, avgqu‑sz.

PV level: check CSI plugin or storage backend metrics.

Tip: await >50 ms or high queue length indicates a disk bottleneck.

Network bandwidth & packet rate

Definition: node traffic (bytes/s) and packet rate (packets/s).

Importance: Saturated bandwidth leads to high latency and packet loss; excessive packet processing consumes CPU.

Investigation:

Node: iftop / nload.

Prometheus metric: node_network_receive_bytes_total.

Advanced tip: Distinguish pod‑to‑pod internal traffic from egress traffic; use mtr or iperf3 for network diagnostics.

2. Application Metrics – "Is the application healthy?"

2.1 Latency and error rate

Definition: request latency (P95/P99) and HTTP 5xx error rate.

Importance: Directly impacts user experience; even with ample resources, slow queries or downstream failures cause high latency.

Investigation:

APM tools such as SkyWalking or Pinpoint.

Ingress/Istio metrics.

Advanced tip: Apply the RED model (Rate, Errors, Duration, Saturation) to monitor application health.

2.2 Pod restart count

Definition: historical restart count of a pod.

Importance: Frequent restarts indicate instability; possible causes include Liveness probe failures, OOMKilled, or application crashes.

Investigation: kubectl get pods to view RESTARTS. kubectl describe pod <pod-name> to see events. kubectl logs --previous to view the previous container logs.

3. System Component Metrics – "Is the Kubernetes control plane healthy?"

3.1 API Server latency

Definition: request latency of the API server, especially LIST/GET operations.

Importance: The API server is the cluster brain; high latency slows pod status updates and scheduling.

Investigation:

Run kubectl get --raw "/metrics" and examine apiserver_request_duration_seconds_bucket.

Check readiness with kubectl get --raw "/readyz".

Advanced tip: High latency often correlates with etcd pressure; monitor etcd_disk_wal_fsync_duration_seconds, etcd_disk_backend_commit_duration_seconds, and etcd_network_peer_round_trip_time_sec.

4. Advanced Dimensions

Scheduling & resource allocation

Investigate pod scheduling latency and node resource fragmentation using kubectl get events --sort-by=.lastTimestamp and kubectl describe nodes.

Service & network layer

Check ClusterIP/NodePort/Ingress latency and CoreDNS performance (e.g., kubectl logs -n kube-system <coredns-pod>, metric coredns_request_duration_seconds).

Storage & StatefulSet

Monitor PV IOPS, throughput, latency, and attach/detach delays.

Controller & replica management

Observe ReplicaSet/Deployment scaling latency and verify HPA effectiveness with kubectl get hpa.

Node health & OS

Watch CPU steal, iowait, load, and system events such as OOM, kernel panic, or network interruptions.

Container runtime

Identify slow container start‑up caused by image pull or runtime issues; view runtime logs with crictl ps and crictl logs.

Application architecture & dependencies

Track external dependency latency (databases, caches, third‑party APIs), thread/connection pool exhaustion, and slow queries.

5. Troubleshooting Process (Macro → Micro)

Macro layer

Cluster health: kubectl get nodes, kubectl get componentstatus.

Resource trends: Prometheus + Grafana dashboards.

Mid layer

Inspect Namespace/Deployment, pod status, restart counts, HPA/replica set status, Service/Ingress latency.

Micro layer

Pod/container details: CPU, memory, I/O, network, application logs, probe status.

Node OS metrics: CPU steal, iowait, load, network packet loss, disk queue length.

Control plane: API server latency, etcd performance, Scheduler/Controller Manager pressure.

6. Recommended Tools

Cluster monitoring – Prometheus + Grafana (CPU, memory, disk, network, pod status trends).

Log aggregation – Loki or ELK (centralized log analysis).

Tracing – SkyWalking, Jaeger, Pinpoint (request tracing, slow request identification).

Node diagnostics – htop, iostat, iftop, netstat, vmstat (node performance, network, disk analysis).

Container diagnostics – crictl, docker logs (runtime issues).

Network diagnostics – mtr, iperf3 (latency, packet loss).

Storage diagnostics – CSI metrics, iostat (PV I/O performance).

Control‑plane metrics – kubectl get --raw /metrics (API server, etcd).

7. Practical Experience Summary

Follow the investigation order: Resources → Applications → System components → External dependencies.

Combine metrics: CPU, memory, I/O, network, latency, error rate using the RED model.

Use event‑driven debugging for Pod Pending, CrashLoopBackOff, OOMKilled, or probe failures.

Correlate monitoring data with logs (historical trends + real‑time logs) to pinpoint issues quickly.

Perform cross‑layer analysis covering container performance, node resources, network, storage, control plane, and application architecture.

This document consolidates the original seven key indicators, advanced troubleshooting techniques for CPU/memory/I/O/network/application/control‑plane, higher‑level dimensions (scheduling, Service, storage, controllers, nodes, containers, architecture), a macro‑to‑micro investigation workflow, tool recommendations, and hands‑on experience into a complete, systematic Kubernetes performance troubleshooting guide.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringperformanceCloud NativeKubernetesMetricstroubleshooting
Ray's Galactic Tech
Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.