Operations 13 min read

Why Kubernetes Requests Slowed: Tracing Nginx, Ingress, and Kernel Delays

This article walks through a detailed investigation of a slow‑response issue in a Kubernetes‑deployed application, covering network topology, full‑link latency statistics, packet captures, kernel‑level bottlenecks, and the final remediation steps to restore performance.

Open Source Linux
Open Source Linux
Open Source Linux
Why Kubernetes Requests Slowed: Tracing Nginx, Ingress, and Kernel Delays

1 Specific Symptom

An application running in the online environment started exhibiting extremely slow API responses.

Many possible causes exist; this post documents several investigations and the final root‑cause determination, offering a reference for similar cluster issues.

2 Network Topology

When a request enters the cluster, the flow is: user request => Nginx => Ingress => uwsgi Both Nginx and Ingress are kept for historical reasons.

3 Initial定位

To rule out application‑level slowness, a simple fast endpoint was added to uwsgi and periodically called. After a few days the endpoint itself became slow, indicating the problem lies in the network path.

4 再次定位 – Simple Full‑Link Statistics

Because there are two Nginx layers, each was examined separately using ELK logs. The following Elasticsearch query was used to collect relevant records:

{
  "bool": {
    "must": [
      {"match_all": {}},
      {"match_phrase": {"app_name": {"query": "xxxx"}}},
      {"match_phrase": {"path": {"query": "/app/v1/user/ping"}}},
      {"range": {"request_time": {"gte": 1, "lt": 10}}},
      {"range": {"@timestamp": {"gt": "2020-11-09 00:00:00", "lte": "2020-11-12 00:00:00", "format": "yyyy-MM-dd HH:mm:ss", "time_zone": "+08:00"}}}
    ]
  }
}

Data were aggregated by trace_id and split into three metrics: request_time(nginx), request_time(ingress), and their difference.

Statistical results (≈3000 records) showed:

NGINX response time

Ingress response time

NGINX‑Ingress response time

Result Analysis

Figure 1: Over half of the requests fall into the 1‑2 s range; longer latencies become rarer.
Figure 2: About a quarter of requests return within 0.1 s, another quarter take 1‑1.1 s, with the rest following a similar pattern.

Combining Figures 1 and 2 suggests that a portion of the delay originates on the Ingress side.

Figure 3: Roughly two‑thirds of requests have consistent response times, while one‑third experience an additional ~1 s delay.

5 Deeper Investigation – Packet Capture

Packet captures focused on the Ingress → uwsgi path. An example log entry:

{
  "_source": {
    "INDEX": "51",
    "path": "/app/v1/media/",
    "user_agent": "okhttp/4.8.1",
    "upstream_connect_time": "1.288",
    "upstream_response_time": "1.400",
    "request": "POST /app/v1/media/ HTTP/1.0",
    "status": "200",
    "request_time": "1.403",
    "trace_id": "87bad3cf9d184df0:87bad3cf9d184df0:0:1"
  }
}

Ingress packet

uwsgi packet

Packet Flow

Reviewing the TCP three‑way handshake revealed retransmissions but no packet loss; most latency stemmed from delayed packet arrival rather than loss.

Not Only SYN‑ACK Delays

Random captures showed retransmissions of SYN‑ACK and FIN‑ACK packets, indicating probabilistic packet‑delay behavior.

Summary of Packet Findings

Packet delays, especially during TCP connection establishment, correlate with increased upstream_connect_time values observed in Nginx logs.

Initial hypothesis: the extra time is spent in TCP handshake due to short‑lived connections; adding the $upstream_connect_time variable to metrics can help quantify this.

Further Work

Based on the hypothesis, the author modified wrk to record connection times (see PR https://github.com/wg/wrk/pull/447) to monitor backend service health.

6 Consulting Experts

After several dead‑ends, the author consulted senior K8s engineers who suggested checking host‑side latency and periodic kernel‑intensive tasks such as cgroup statistics, which can degrade network performance.

High host latency may be caused by frequent kernel‑level operations (e.g., cgroup reads) that increase CPU load and delay TCP handshakes.

A kernel tracing tool ( trace‑irqoff ) was used to identify long IRQ‑off periods.

Flame graphs from kubelet tracing confirmed that a large portion of time was spent reading kernel information.

Corresponding code snippets were captured (image omitted for brevity).

Final Diagnosis

Excessive periodic tasks on the host caused kernel cache buildup, slowing kernel operations, extending TCP handshake times, and degrading user experience. The remediation was to increase task intervals and clear caches:

sync && echo 3 > /proc/sys/vm/drop_caches

7 Overall Summary

The investigation started from an application‑level latency symptom, progressed through network‑layer tracing, custom metric collection, packet analysis, and kernel‑level profiling, ultimately pinpointing host‑side kernel slowdown as the root cause. The author hopes this walkthrough helps others troubleshoot similar cluster issues.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

networktroubleshootingNginxIngress
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.