Backend Development 10 min read

How a Hidden gcache Memory Leak Undermined 99.9% Service Availability

A high‑traffic Go service suffered intermittent availability drops below 99.9% due to timeouts, and after extensive profiling, tracing, and heap analysis the root cause was identified as a memory‑leak bug in the third‑party gcache LFU implementation, which was fixed by upgrading the library.

Efficient Ops
Efficient Ops
Efficient Ops
How a Hidden gcache Memory Leak Undermined 99.9% Service Availability

Background

Users reported occasional availability dips below 0.999 for a service handling over 100k QPS, causing timeouts during peak traffic, especially on the appointment‑booking API.

The service is a large monolith using Redis, Memcached, MySQL, Taishan, Elasticsearch, MongoDB, Databus, etc., making root‑cause analysis challenging.

Hypotheses and Elimination

Initial guesses included business‑logic performance issues, abnormal traffic, system calls, CPU throttling, middleware problems, Go scheduler or GC anomalies. Traffic analysis showed regular patterns, so abnormal traffic was ruled out. CPU throttling was low, and reducing cache memory did not improve availability.

Trace data revealed both MySQL and Redis timeouts, but their query times were minimal, and no component showed clear faults.

Investigation Steps

Collect pprof data: CPU profiler for CPU usage, memory profiler for GC issues.

Enable

GODEBUG=gctrace=1

to inspect GC behavior during availability dips.

Add

fgprof

to detect off‑CPU time (I/O, locks, timers, page swaps).

Capture Go trace to examine scheduler behavior, GC, pre‑emptions.

Use Linux

strace

to check system calls that might cause timeouts.

Analysis

gctrace

Concurrent mark and scan phases consumed about 860 ms, unusually long, but no clear correlation with the availability spikes.

strace

No abnormal system calls were observed; this hypothesis was discarded.

fgprof

No off‑CPU bottlenecks were detected.

Go trace

Repeated traces showed frequent “MARK ASSIST” events, indicating noticeable GC activity.

Heap analysis

Heap snapshots showed a large memory consumption by gRPC connections and an excessive number of objects created by the gcache LFU implementation (over 1 million objects out of 3 million total), suggesting a leak.

Resolution

The gcache LFU

Get

method was identified as the culprit, leaking memory (~100 MB, ~2.5% of total). Upgrading gcache from v0.0.1 to v0.0.2 eliminated the leak and restored stable availability.

Key takeaways: third‑party libraries can introduce subtle bugs; rigorous monitoring, automated profiling, and timely upgrades are essential for system stability.

References

Analysis of Go Off‑CPU performance – https://colobu.com/2020/11/12/analyze-On-CPU-in-go/

Holmes design – https://mosn.io/blog/posts/mosn-holmes-design/

gcache repository – https://github.com/bluele/gcache

Deep dive into Go pprof – https://www.cnblogs.com/qcrao-2018/p/11832732.html

backendperformanceGotroubleshootingGCProfilinggcache
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.