How a Hidden gcache Memory Leak Undermined 99.9% Service Availability
A high‑traffic Go service suffered intermittent availability drops below 99.9% due to timeouts, and after extensive profiling, tracing, and heap analysis the root cause was identified as a memory‑leak bug in the third‑party gcache LFU implementation, which was fixed by upgrading the library.
Background
Users reported occasional availability dips below 0.999 for a service handling over 100k QPS, causing timeouts during peak traffic, especially on the appointment‑booking API.
The service is a large monolith using Redis, Memcached, MySQL, Taishan, Elasticsearch, MongoDB, Databus, etc., making root‑cause analysis challenging.
Hypotheses and Elimination
Initial guesses included business‑logic performance issues, abnormal traffic, system calls, CPU throttling, middleware problems, Go scheduler or GC anomalies. Traffic analysis showed regular patterns, so abnormal traffic was ruled out. CPU throttling was low, and reducing cache memory did not improve availability.
Trace data revealed both MySQL and Redis timeouts, but their query times were minimal, and no component showed clear faults.
Investigation Steps
Collect pprof data: CPU profiler for CPU usage, memory profiler for GC issues.
Enable
GODEBUG=gctrace=1to inspect GC behavior during availability dips.
Add
fgprofto detect off‑CPU time (I/O, locks, timers, page swaps).
Capture Go trace to examine scheduler behavior, GC, pre‑emptions.
Use Linux
straceto check system calls that might cause timeouts.
Analysis
gctrace
Concurrent mark and scan phases consumed about 860 ms, unusually long, but no clear correlation with the availability spikes.
strace
No abnormal system calls were observed; this hypothesis was discarded.
fgprof
No off‑CPU bottlenecks were detected.
Go trace
Repeated traces showed frequent “MARK ASSIST” events, indicating noticeable GC activity.
Heap analysis
Heap snapshots showed a large memory consumption by gRPC connections and an excessive number of objects created by the gcache LFU implementation (over 1 million objects out of 3 million total), suggesting a leak.
Resolution
The gcache LFU
Getmethod was identified as the culprit, leaking memory (~100 MB, ~2.5% of total). Upgrading gcache from v0.0.1 to v0.0.2 eliminated the leak and restored stable availability.
Key takeaways: third‑party libraries can introduce subtle bugs; rigorous monitoring, automated profiling, and timely upgrades are essential for system stability.
References
Analysis of Go Off‑CPU performance – https://colobu.com/2020/11/12/analyze-On-CPU-in-go/
Holmes design – https://mosn.io/blog/posts/mosn-holmes-design/
gcache repository – https://github.com/bluele/gcache
Deep dive into Go pprof – https://www.cnblogs.com/qcrao-2018/p/11832732.html
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.