How We Traced a 48‑Hour Memory Leak in a Distributed Coordination Service
This article details a step‑by‑step investigation of repeated follower process alerts in a Paxos‑based distributed coordination service, revealing a Java GC pause‑induced memory leak in the front‑end Proxy and describing the rapid mitigation actions taken to restore system stability.
1. Problem Emergence
In late October 2019, multiple online alerts indicated that follower processes of a distributed coordination service repeatedly exited and re‑joined the quorum, e.g., a follower unexpectedly left the quorum at 14:04:28, restarted at 16:06:35, and so on.
2. System Architecture
The service uses a Paxos‑based consistency module with five master machines, tolerating up to two simultaneous failures. Although the alerts did not affect overall availability, the frequent follower anomalies posed a serious stability risk.
3. Initial Investigation
Network metrics were normal, so logs were examined. Leader logs showed that at each alert time the leader actively closed the communication channel with the follower because the follower had not responded to heartbeat requests, causing the leader to deem the follower abnormal and remove it from the quorum.
4. Root Cause Analysis
Further analysis revealed that the follower process was hanging due to a prolonged Java GC pause. The GC log showed an excessively long ParNew pause, which triggers a Stop‑The‑World (STW) event that suspends all non‑GC threads.
The machine also suffered from high memory pressure: the front‑end Proxy process consumed over 66% of total memory, while the back‑end consistency process used about 30%.
OOM events for the Proxy process were observed, prompting a deeper memory‑leak investigation.
5. Deep Investigation of the Proxy Leak
Using gdb and top, the unordered_map used for address caching was found to be within expected size, so the leak source was not obvious.
Advanced vtable analysis (based on tcmalloc) identified a massive leak of common::Closure<void, Env*> objects (over 1.6 billion instances).
$grep Closure -r proxy | grep Env proxy/io_handler.h: typedef common::Closure<void, Env*> CheckCall;Log analysis showed a high volume of illegal access requests where clients used an incorrect cluster name, generating thousands of error logs per second. In the error path, the CheckCall object was returned early without being destroyed, causing the memory leak.
6. Risk Mitigation
Two remediation options were considered:
Ask the business side to stop the erroneous access pattern.
Fix the bug in the Proxy code and roll out an upgrade.
Due to limited upgrade windows before a major sales event, the team chose the first option: they coordinated with the business team to deploy a hot‑fix that eliminated the illegal cluster‑name accesses, immediately reducing the leak trend.
7. Permanent Fix
The long‑term solution involved modifying the Proxy so that even in error paths the CheckCall closure is executed and allowed to self‑destruct, adhering to a single‑exit principle. This fix was scheduled for release after the sales peak.
8. Summary
Effective stability work requires meticulous monitoring of every alert, thorough root‑cause analysis, and prompt risk remediation; such disciplined practices are essential for building highly reliable distributed systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
