Operations 12 min read

How We Traced a 48‑Hour Memory Leak in a Distributed Coordination Service

This article details a step‑by‑step investigation of repeated follower process alerts in a Paxos‑based distributed coordination service, revealing a Java GC pause‑induced memory leak in the front‑end Proxy and describing the rapid mitigation actions taken to restore system stability.

Alibaba Cloud Developer

Dec 20, 2019

How We Traced a 48‑Hour Memory Leak in a Distributed Coordination Service

1. Problem Emergence

In late October 2019, multiple online alerts indicated that follower processes of a distributed coordination service repeatedly exited and re‑joined the quorum, e.g., a follower unexpectedly left the quorum at 14:04:28, restarted at 16:06:35, and so on.

2. System Architecture

The service uses a Paxos‑based consistency module with five master machines, tolerating up to two simultaneous failures. Although the alerts did not affect overall availability, the frequent follower anomalies posed a serious stability risk.

3. Initial Investigation

Network metrics were normal, so logs were examined. Leader logs showed that at each alert time the leader actively closed the communication channel with the follower because the follower had not responded to heartbeat requests, causing the leader to deem the follower abnormal and remove it from the quorum.

4. Root Cause Analysis

Further analysis revealed that the follower process was hanging due to a prolonged Java GC pause. The GC log showed an excessively long ParNew pause, which triggers a Stop‑The‑World (STW) event that suspends all non‑GC threads.

The machine also suffered from high memory pressure: the front‑end Proxy process consumed over 66% of total memory, while the back‑end consistency process used about 30%.

OOM events for the Proxy process were observed, prompting a deeper memory‑leak investigation.

5. Deep Investigation of the Proxy Leak

Using gdb and top, the unordered_map used for address caching was found to be within expected size, so the leak source was not obvious.

Advanced vtable analysis (based on tcmalloc) identified a massive leak of common::Closure<void, Env*> objects (over 1.6 billion instances).

$grep Closure -r proxy | grep Env

proxy/io_handler.h:    typedef common::Closure<void, Env*>  CheckCall;

Log analysis showed a high volume of illegal access requests where clients used an incorrect cluster name, generating thousands of error logs per second. In the error path, the CheckCall object was returned early without being destroyed, causing the memory leak.

6. Risk Mitigation

Two remediation options were considered:

Ask the business side to stop the erroneous access pattern.

Fix the bug in the Proxy code and roll out an upgrade.

Due to limited upgrade windows before a major sales event, the team chose the first option: they coordinated with the business team to deploy a hot‑fix that eliminated the illegal cluster‑name accesses, immediately reducing the leak trend.

7. Permanent Fix

The long‑term solution involved modifying the Proxy so that even in error paths the CheckCall closure is executed and allowed to self‑destruct, adhering to a single‑exit principle. This fix was scheduled for release after the sales peak.

8. Summary

Effective stability work requires meticulous monitoring of every alert, thorough root‑cause analysis, and prompt risk remediation; such disciplined practices are essential for building highly reliable distributed systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Monitoring memory-leak incident response java-gc

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.