Why Our 10w+ QPS Go Service Fell Below 0.999 Availability – GC, Trace & GCache Fix
After a high‑traffic Go service’s availability slipped below the 99.9% target, the team traced the issue through pprof, gctrace, strace, fgprof, and heap analysis, ultimately identifying a memory‑leak bug in the third‑party gcache LFU implementation and resolving it with an upgrade.
Background
A high‑traffic reservation service handling over 100,000 QPS experienced occasional availability dips below the 99.9% SLA, manifested as time‑outs on the reservation API during traffic peaks. The service is a large monolith that depends on Redis, Memcached, MySQL, Elasticsearch, MongoDB and a data bus.
Hypotheses and Elimination
Potential causes considered included inefficient business logic, abnormal traffic, system‑call latency, CPU throttling, middleware issues (Redis, MySQL), Go scheduler behavior, and GC pauses. Traffic patterns were regular, so abnormal traffic was ruled out. CPU throttling metrics were low and reducing cache size did not improve availability, so CPU throttling was also excluded.
Investigation Methodology
The focus shifted to the Go runtime and system‑call behavior. The following techniques were applied:
Collect pprof CPU and memory profiles.
Enable GODEBUG=gctrace=1 to capture GC traces.
Add fgprof to detect off‑CPU time (I/O, locks, page swaps).
Capture Go trace files to observe scheduler activity.
Run strace on the process to inspect system calls.
Findings
GC trace
GC traces showed a concurrent mark‑and‑scan pause of roughly 860 ms, consuming most CPU time during the spikes. No direct correlation with the availability dip could be proven.
strace
System‑call traces contained only normal calls; no latency in syscalls was observed.
fgprof
Off‑CPU analysis did not reveal significant blocking, confirming the strace results.
Go trace
Repeated Go trace captures (more than 20 times) consistently displayed “MARK ASSIST” events, indicating noticeable GC activity during the problematic periods.
Heap and object inspection
Heap snapshots showed large memory usage by gRPC connections. More critically, inuse_objects revealed that the LFU cache in the gcache library generated over 1 million objects, far exceeding the total object count of ~300 k, suggesting a memory‑leak scenario.
Root Cause
The third‑party gcache library (LFU strategy) has a known issue (see https://github.com/bluele/gcache/issues/71) where the Get method retains pointers, leaking about 100 MB (≈2.5 % of total memory). Upgrading gcache from v0.0.1 to v0.0.2 eliminated the leak and restored service availability.
Remediation Steps
Update the dependency, e.g., go get github.com/bluele/[email protected] or modify go.mod accordingly.
Re‑run the profiling suite (pprof, GODEBUG, fgprof, Go trace, strace) to verify that GC pause times and object counts return to normal.
Continuously monitor inuse_objects and GC pause metrics after deployment.
Key Takeaways
Third‑party libraries can contain subtle bugs that impact production stability; enforce strict vetting, version monitoring, and automated issue tracking.
Comprehensive runtime observability (pprof, GODEBUG, fgprof, Go trace, strace) is essential for diagnosing transient latency spikes in high‑QPS services.
Automated snapshot collection based on abnormal metrics (e.g., MOSN Holmes) can capture evidence before it disappears, aiding rapid root‑cause analysis.
References
Analyzing Go Off‑CPU Performance – https://colobu.com/2020/11/12/analyze-On-CPU-in-go/
Holmes design overview – https://mosn.io/blog/posts/mosn-holmes-design/
Holmes GitHub – https://github.com/mosn/holmes
gcache GitHub – https://github.com/bluele/gcache
Deep dive into Go pprof – https://www.cnblogs.com/qcrao-2018/p/11832732.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
