Why Our 10w+ QPS Go Service Fell Below 0.999 Availability – GC, Trace & GCache Fix

After a high‑traffic Go service’s availability slipped below the 99.9% target, the team traced the issue through pprof, gctrace, strace, fgprof, and heap analysis, ultimately identifying a memory‑leak bug in the third‑party gcache LFU implementation and resolving it with an upgrade.

dbaplus Community
dbaplus Community
dbaplus Community
Why Our 10w+ QPS Go Service Fell Below 0.999 Availability – GC, Trace & GCache Fix

Background

A high‑traffic reservation service handling over 100,000 QPS experienced occasional availability dips below the 99.9% SLA, manifested as time‑outs on the reservation API during traffic peaks. The service is a large monolith that depends on Redis, Memcached, MySQL, Elasticsearch, MongoDB and a data bus.

Hypotheses and Elimination

Potential causes considered included inefficient business logic, abnormal traffic, system‑call latency, CPU throttling, middleware issues (Redis, MySQL), Go scheduler behavior, and GC pauses. Traffic patterns were regular, so abnormal traffic was ruled out. CPU throttling metrics were low and reducing cache size did not improve availability, so CPU throttling was also excluded.

Investigation Methodology

The focus shifted to the Go runtime and system‑call behavior. The following techniques were applied:

Collect pprof CPU and memory profiles.

Enable GODEBUG=gctrace=1 to capture GC traces.

Add fgprof to detect off‑CPU time (I/O, locks, page swaps).

Capture Go trace files to observe scheduler activity.

Run strace on the process to inspect system calls.

Findings

GC trace

GC traces showed a concurrent mark‑and‑scan pause of roughly 860 ms, consuming most CPU time during the spikes. No direct correlation with the availability dip could be proven.

strace

System‑call traces contained only normal calls; no latency in syscalls was observed.

fgprof

Off‑CPU analysis did not reveal significant blocking, confirming the strace results.

Go trace

Repeated Go trace captures (more than 20 times) consistently displayed “MARK ASSIST” events, indicating noticeable GC activity during the problematic periods.

Heap and object inspection

Heap snapshots showed large memory usage by gRPC connections. More critically, inuse_objects revealed that the LFU cache in the gcache library generated over 1 million objects, far exceeding the total object count of ~300 k, suggesting a memory‑leak scenario.

Root Cause

The third‑party gcache library (LFU strategy) has a known issue (see https://github.com/bluele/gcache/issues/71) where the Get method retains pointers, leaking about 100 MB (≈2.5 % of total memory). Upgrading gcache from v0.0.1 to v0.0.2 eliminated the leak and restored service availability.

Remediation Steps

Update the dependency, e.g., go get github.com/bluele/[email protected] or modify go.mod accordingly.

Re‑run the profiling suite (pprof, GODEBUG, fgprof, Go trace, strace) to verify that GC pause times and object counts return to normal.

Continuously monitor inuse_objects and GC pause metrics after deployment.

Key Takeaways

Third‑party libraries can contain subtle bugs that impact production stability; enforce strict vetting, version monitoring, and automated issue tracking.

Comprehensive runtime observability (pprof, GODEBUG, fgprof, Go trace, strace) is essential for diagnosing transient latency spikes in high‑QPS services.

Automated snapshot collection based on abnormal metrics (e.g., MOSN Holmes) can capture evidence before it disappears, aiding rapid root‑cause analysis.

References

Analyzing Go Off‑CPU Performance – https://colobu.com/2020/11/12/analyze-On-CPU-in-go/

Holmes design overview – https://mosn.io/blog/posts/mosn-holmes-design/

Holmes GitHub – https://github.com/mosn/holmes

gcache GitHub – https://github.com/bluele/gcache

Deep dive into Go pprof – https://www.cnblogs.com/qcrao-2018/p/11832732.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performancepprofgcache
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.