Investigation of 300‑Second Redis Timeout Issues in a Go Service
The article details how a Go service’s 300‑second Redis call timeout was traced to a gateway’s full‑NAT session‑table loss, and explains how targeted retries, proper timeout settings, and rate‑limiting can prevent similar cascading failures in distributed systems.
Online problem troubleshooting, although infrequent, is critical for engineers. This article shares a real‑world case where a Go service experienced two severe timeout problems: a Redis call that blocked the program for about 300 seconds and an upstream request that timed out between 120 ms and 200 ms.
1️⃣ The first step was to rule out network jitter. Coordination with the Redis team confirmed that Redis processing time is typically under 1 ms. The operations team reported occasional 10 Gbps NIC packet loss, but they believed retries would prevent long delays. (Later it turned out some retry packets were also lost.)
2️⃣ Next, the SDK was suspected. The library was upgraded to the latest version and more detailed debug logs were added. The logs revealed a read‑side block and the error connection reset by peer , suggesting the issue was not in the client library.
3️⃣ Various hypotheses were explored (packet loss, kernel parameters, Gateway upgrades). Multiple machines showed the same 300 s latency, and other departments reported similar Redis‑related timeouts. Because the problem was hard to reproduce, packet captures were taken on a few client machines.
4️⃣ Packet capture on the Redis proxy side showed the client continuously retransmitting (TCP_RTO_MIN = 200 ms, then 400 ms, 800 ms, …) until the server finally sent a RST packet. During this period the server sent a keep‑alive packet after 75 seconds (the Redis keep‑alive interval), and the client ACKed it, but the server kept sending keep‑alive packets, indicating the server was not receiving the client’s packets.
5️⃣ The service uses a VIP that points to a shared Gateway. By bypassing the Gateway and connecting directly to the real IP, the issue disappeared, pointing to the Gateway as the root cause. However, the operations team could not initially find any anomalies in the Gateway.
6️⃣ Further investigation revealed the Gateway operates in “fullnat” mode, where two session tables (IN and OUT) are maintained. If the IN table is lost while the OUT table remains, the client can receive server packets while the server cannot receive client packets, matching the observed traffic.
After confirming with the router vendor, the problem was attributed to the Gateway’s session handling in fullnat mode.
**Reflection on timeouts** – Improving infrastructure stability has diminishing returns; a small packet‑loss reduction can be far more costly than adding a low‑cost retry mechanism. Adding a single retry can reduce an effective failure rate from 2 % to 0.04 %, and a second retry can push it to 0.0004 %.
**Retry strategy discussion** – Placing retries only at the top of the call chain can cause a cascade of retries across the entire service mesh during a network glitch, leading to a chain‑wide “avalanche”. Instead, retries should be applied closer to the downstream services, combined with proper rate‑limiting and circuit‑breaking to prevent overload.
**Choosing timeout and retry counts** – The timeout and retry numbers should be balanced. For a service that needs five‑nines availability with a 98th‑percentile latency of 20 ms, a 20 ms timeout yields a 2 % failure rate. One retry reduces this to 0.04 % (99.96 % success); a second retry reaches 99.9992 %. Generally, no more than three retries are recommended, and timeout values should be set based on SLA requirements and observed latency percentiles.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.