Cloud Native 15 min read

How to Detect and Resolve Slow Calls in Kubernetes: Best Practices & Real‑World Cases

This article explains why slow calls in Kubernetes can jeopardize user experience, project timelines, and system stability, outlines five common causes, introduces the golden‑signal and USE analysis framework, and walks through three practical case studies with step‑by‑step troubleshooting and remediation techniques.

Alibaba Cloud Native

Nov 30, 2021

How to Detect and Resolve Slow Calls in Kubernetes: Best Practices & Real‑World Cases

Why Slow Calls Matter

Slow calls are a common anomaly that can cause front‑end latency, increased uninstall rates, missed Service Level Objectives (SLOs), project delays, and cascading failures when overloaded services trigger massive retries that exhaust resources.

Common Causes of Slow Calls

High resource utilization (CPU, memory, disk, network) on the service host.

Poor code design, e.g., complex SQL queries that join many tables.

Downstream dependency latency (slow responses from databases, caches, or other services).

Architectural issues such as massive tables without sharding or missing caching for expensive operations.

Network problems, including cross‑continent latency, high packet loss, or low bandwidth.

Analysis Framework

Golden Signals (Google SRE)

Latency – request‑level response time (average, P90/P95/P99).

Traffic – request volume (QPS/TPS).

Errors – HTTP 4xx/5xx or other failure codes.

Saturation – resource pressure (CPU, memory, disk, queue length, connection count).

USE Method (Utilization‑Saturation‑Errors)

For each resource, examine utilization, saturation, and error metrics. This approach resolves roughly 80 % of performance issues with minimal effort.

Global Topology View

Visualize the full service graph to locate bottlenecks that span multiple components.

Best‑Practice Workflow

Define proactive alerts on critical paths (e.g., P99 latency > 1 s, CPU > 70 %).

When an alert fires, verify abnormal latency or error spikes via golden signals.

Drill down to resource‑level metrics using the USE method (CPU, memory, network, disk).

If the service itself appears healthy, trace downstream dependencies and examine the topology to pinpoint the root cause.

Apply remediation: horizontal pod autoscaling (HPA), query optimization, caching, or network tuning.

Case Study 1 – Node CPU Saturation

In an e‑commerce Kubernetes cluster, a node’s CPU was forced to 100 % using chaosblade fault injection on the gateway node. Alerts on the gateway (P99 latency > 1 s) fired, and the golden‑signal view showed a sharp latency increase and thousands of slow‑call events. Pod‑level CPU metrics confirmed near‑full CPU usage, while the node’s CPU metric was at 100 %.

Remediation: configure a Horizontal Pod Autoscaler (HPA) with a CPU target of 70 % (e.g.,

kubectl autoscale deployment gateway --cpu-percent=70 --min=3 --max=10

). After HPA triggered, additional pod replicas were created, slow‑call counts dropped, and latency returned to normal.

Case Study 2 – Downstream Service Latency

A slow MySQL query was injected into ProductService using chaosblade. Alerts on both the gateway and ProductService (P99 > 1 s) triggered. Tracing revealed a complex multi‑table JOIN that consumed > 1 s on the MySQL side.

Resolution steps:

Analyze the SQL statement and add appropriate indexes.

Rewrite the query to reduce join complexity.

Introduce a cache layer (e.g., Redis) for frequently accessed product data.

Case Study 3 – Network Performance Degradation

Packet‑loss was injected on the MySQL node (using chaosblade netem loss 30%). Alerts on the gateway and ProductService fired, showing P99 latency > 1 s. RTT metrics across the path (gateway → ProductService → MySQL) spiked, confirming a network‑level issue.

Mitigation options include improving the network path (e.g., using higher‑performance VPC), adjusting timeout settings, or adding redundant network routes.

Key Takeaways

Deploy default RED (Rate, Errors, Duration) alerts to surface anomalies early.

Combine golden signals, USE metrics, and distributed tracing to locate root causes quickly.

Leverage a topology‑aware view of the service graph for global‑scope troubleshooting and continuous system improvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Kubernetes ARMS slow calls

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.