Why Kubernetes Requests Slowed: Tracing Nginx, Ingress, and Kernel Delays
This article walks through a detailed investigation of a slow‑response issue in a Kubernetes‑deployed application, covering network topology, full‑link latency statistics, packet captures, kernel‑level bottlenecks, and the final remediation steps to restore performance.
1 Specific Symptom
An application running in the online environment started exhibiting extremely slow API responses.
Many possible causes exist; this post documents several investigations and the final root‑cause determination, offering a reference for similar cluster issues.
2 Network Topology
When a request enters the cluster, the flow is: user request => Nginx => Ingress => uwsgi Both Nginx and Ingress are kept for historical reasons.
3 Initial定位
To rule out application‑level slowness, a simple fast endpoint was added to uwsgi and periodically called. After a few days the endpoint itself became slow, indicating the problem lies in the network path.
4 再次定位 – Simple Full‑Link Statistics
Because there are two Nginx layers, each was examined separately using ELK logs. The following Elasticsearch query was used to collect relevant records:
{
"bool": {
"must": [
{"match_all": {}},
{"match_phrase": {"app_name": {"query": "xxxx"}}},
{"match_phrase": {"path": {"query": "/app/v1/user/ping"}}},
{"range": {"request_time": {"gte": 1, "lt": 10}}},
{"range": {"@timestamp": {"gt": "2020-11-09 00:00:00", "lte": "2020-11-12 00:00:00", "format": "yyyy-MM-dd HH:mm:ss", "time_zone": "+08:00"}}}
]
}
}Data were aggregated by trace_id and split into three metrics: request_time(nginx), request_time(ingress), and their difference.
Statistical results (≈3000 records) showed:
NGINX response time
Ingress response time
NGINX‑Ingress response time
Result Analysis
Figure 1: Over half of the requests fall into the 1‑2 s range; longer latencies become rarer.
Figure 2: About a quarter of requests return within 0.1 s, another quarter take 1‑1.1 s, with the rest following a similar pattern.
Combining Figures 1 and 2 suggests that a portion of the delay originates on the Ingress side.
Figure 3: Roughly two‑thirds of requests have consistent response times, while one‑third experience an additional ~1 s delay.
5 Deeper Investigation – Packet Capture
Packet captures focused on the Ingress → uwsgi path. An example log entry:
{
"_source": {
"INDEX": "51",
"path": "/app/v1/media/",
"user_agent": "okhttp/4.8.1",
"upstream_connect_time": "1.288",
"upstream_response_time": "1.400",
"request": "POST /app/v1/media/ HTTP/1.0",
"status": "200",
"request_time": "1.403",
"trace_id": "87bad3cf9d184df0:87bad3cf9d184df0:0:1"
}
}Ingress packet
uwsgi packet
Packet Flow
Reviewing the TCP three‑way handshake revealed retransmissions but no packet loss; most latency stemmed from delayed packet arrival rather than loss.
Not Only SYN‑ACK Delays
Random captures showed retransmissions of SYN‑ACK and FIN‑ACK packets, indicating probabilistic packet‑delay behavior.
Summary of Packet Findings
Packet delays, especially during TCP connection establishment, correlate with increased upstream_connect_time values observed in Nginx logs.
Initial hypothesis: the extra time is spent in TCP handshake due to short‑lived connections; adding the $upstream_connect_time variable to metrics can help quantify this.
Further Work
Based on the hypothesis, the author modified wrk to record connection times (see PR https://github.com/wg/wrk/pull/447) to monitor backend service health.
6 Consulting Experts
After several dead‑ends, the author consulted senior K8s engineers who suggested checking host‑side latency and periodic kernel‑intensive tasks such as cgroup statistics, which can degrade network performance.
High host latency may be caused by frequent kernel‑level operations (e.g., cgroup reads) that increase CPU load and delay TCP handshakes.
A kernel tracing tool ( trace‑irqoff ) was used to identify long IRQ‑off periods.
Flame graphs from kubelet tracing confirmed that a large portion of time was spent reading kernel information.
Corresponding code snippets were captured (image omitted for brevity).
Final Diagnosis
Excessive periodic tasks on the host caused kernel cache buildup, slowing kernel operations, extending TCP handshake times, and degrading user experience. The remediation was to increase task intervals and clear caches:
sync && echo 3 > /proc/sys/vm/drop_caches7 Overall Summary
The investigation started from an application‑level latency symptom, progressed through network‑layer tracing, custom metric collection, packet analysis, and kernel‑level profiling, ultimately pinpointing host‑side kernel slowdown as the root cause. The author hopes this walkthrough helps others troubleshoot similar cluster issues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
