Cache Instance Failure Incident Analysis and Root Cause Investigation
During a night‑time outage, a XCache (Codis + Pika) instance hung due to massive write load triggering low‑level protection, causing Sentinel to switch masters; the proxy’s accept queue filled with timed‑out sockets, blocking new connections, so scaling the proxy layer and expanding capacity restored service while prompting automation, health‑check, and queue‑overflow alerts.
At around 21:00 on a dark night, while the author was buying food at a convenience store, an alarm reported that a cache instance had gone down and a service circuit‑breaker alert was triggered. He rushed home to start troubleshooting.
The cluster uses XCache (Codis + Pika). After checking the three key metrics, Sentinel had already performed a master‑slave switch; the new master was receiving traffic and latency was below 1 ms, so the cluster appeared healthy. However, the business still reported inability to obtain cache connections and request timeouts.
Since the root cause was not yet clear, a rapid mitigation was performed: cluster scaling, master‑slave switch, and command circuit‑breaking. Because the business connects through a proxy, a new machine was added to expand the proxy layer. After the proxy expansion, alarm counts began to drop and the business gradually returned to normal.
Root Cause Analysis:
4.1 Why the instance appeared to be down – The process was still running, but logs showed a massive write load that triggered low‑level write protection, causing the instance to hang. Sentinel’s health check then failed, prompting the master‑slave switch.
4.2 Why business traffic did not recover – Monitoring revealed one machine with abnormal metrics. After the master node of a shard failed, Sentinel switched roles, causing clients to receive slot errors, disconnect from the proxy, and reconnect. A flood of concurrent connections overflowed the proxy’s accept queue, leading to dropped connections.
The accept queue size was 32,768, but only about 7,700 TCP connections were actually established. Many invalid connections accumulated in the accept queue because clients closed the socket after a 2‑second timeout while waiting to send the AUTH command. These closed sockets remained in the queue, consuming CPU and preventing the proxy from accepting new valid connections, creating a vicious cycle.
The client‑proxy communication flow can be summarized as follows: the TCP connection is established, the client’s socket (fdx) is placed in the proxy’s accept queue, the client sends AUTH, but if the proxy has not yet accepted the socket before the 2‑second timeout, the client closes fdx, leaving an invalid fd in the queue. The proxy does not report an error for these invalid fds, yet it still spawns coroutines to handle them, consuming resources and further preventing the acceptance of valid connections.
Summary and Next Steps:
Automate fault recovery to reduce manual intervention time.
Improve health‑check mechanisms and control connection/QPS limits to mitigate risk.
Add alerting for TCP connection‑queue overflow.
Optimize the proxy network model to discard invalid fds directly.
Ximalaya Technology Team
Official account of Ximalaya's technology team, sharing distilled technical experience and insights to grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.