How a Hidden gcache Memory Leak Undermined 99.9% Service Availability
A high‑traffic Go service suffered intermittent availability drops below 99.9% due to timeouts, and after extensive profiling, tracing, and heap analysis the root cause was identified as a memory‑leak bug in the third‑party gcache LFU implementation, which was fixed by upgrading the library.
Background
Users reported occasional availability dips below 0.999 for a service handling over 100k QPS, causing timeouts during peak traffic, especially on the appointment‑booking API.
The service is a large monolith using Redis, Memcached, MySQL, Taishan, Elasticsearch, MongoDB, Databus, etc., making root‑cause analysis challenging.
Hypotheses and Elimination
Initial guesses included business‑logic performance issues, abnormal traffic, system calls, CPU throttling, middleware problems, Go scheduler or GC anomalies. Traffic analysis showed regular patterns, so abnormal traffic was ruled out. CPU throttling was low, and reducing cache memory did not improve availability.
Trace data revealed both MySQL and Redis timeouts, but their query times were minimal, and no component showed clear faults.
Investigation Steps
Collect pprof data: CPU profiler for CPU usage, memory profiler for GC issues.
Enable GODEBUG=gctrace=1 to inspect GC behavior during availability dips.
Add fgprof to detect off‑CPU time (I/O, locks, timers, page swaps).
Capture Go trace to examine scheduler behavior, GC, pre‑emptions.
Use Linux strace to check system calls that might cause timeouts.
Analysis
gctrace
Concurrent mark and scan phases consumed about 860 ms, unusually long, but no clear correlation with the availability spikes.
strace
No abnormal system calls were observed; this hypothesis was discarded.
fgprof
No off‑CPU bottlenecks were detected.
Go trace
Repeated traces showed frequent “MARK ASSIST” events, indicating noticeable GC activity.
Heap analysis
Heap snapshots showed a large memory consumption by gRPC connections and an excessive number of objects created by the gcache LFU implementation (over 1 million objects out of 3 million total), suggesting a leak.
Resolution
The gcache LFU Get method was identified as the culprit, leaking memory (~100 MB, ~2.5% of total). Upgrading gcache from v0.0.1 to v0.0.2 eliminated the leak and restored stable availability.
Key takeaways: third‑party libraries can introduce subtle bugs; rigorous monitoring, automated profiling, and timely upgrades are essential for system stability.
References
Analysis of Go Off‑CPU performance – https://colobu.com/2020/11/12/analyze-On-CPU-in-go/
Holmes design – https://mosn.io/blog/posts/mosn-holmes-design/
gcache repository – https://github.com/bluele/gcache
Deep dive into Go pprof – https://www.cnblogs.com/qcrao-2018/p/11832732.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
