Backend Development 10 min read

Investigation and Resolution of Service Availability Fluctuations in a High‑QPS Go Backend Service

An investigation of a 100k‑QPS Go monolith revealed that intermittent availability drops were caused by a memory‑leak in the third‑party gcache LFU implementation, which inflated GC work and produced long mark phases; upgrading gcache eliminated the leak and restored 0.999+ availability, highlighting the need for thorough observability and dependency monitoring.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Investigation and Resolution of Service Availability Fluctuations in a High‑QPS Go Backend Service

Background: A high‑traffic service (over 100,000 QPS) experienced occasional availability drops below the 0.999 threshold, causing time‑outs on a critical reservation API during peak periods. The service is a large monolith that uses many components such as Redis, MySQL, Elasticsearch, MongoDB, and a custom gcache library.

Initial hypotheses included business‑level performance issues, abnormal traffic, CPU throttling, middleware problems, Go scheduler or GC pauses, and system calls. Traffic analysis showed regular patterns, ruling out abnormal traffic spikes.

Trace analysis revealed that both MySQL and Redis occasionally timed out, but their query latencies (0.01 ms for MySQL, ~21 ms for Redis) were far below the 250 ms request quota, suggesting the bottleneck lay elsewhere.

Investigation steps:

Collected pprof profiles to examine CPU and memory usage.

Enabled GODEBUG=gctrace=1 to capture GC traces.

Added fgprof to detect off‑CPU activity.

Captured Go traces to observe scheduler behavior.

Used Linux strace to monitor system calls.

Findings:

GC trace showed a long concurrent mark and scan phase (~860 ms), which could consume CPU and delay goroutine scheduling.

strace and fgprof did not reveal abnormal system calls or off‑CPU blocking.

Repeated Go trace captures consistently displayed MARK ASSIST events, indicating a GC‑related issue.

Heap analysis during low‑traffic periods showed that the gcache LFU implementation created over 1 million objects, far exceeding the expected count and consuming significant memory.

Root cause: A memory leak in the LFU strategy of the third‑party gcache library (github.com/bluele/gcache) that generated excessive pointer objects, leading to increased GC work and occasional latency spikes.

Resolution: Upgraded gcache from version v0.0.1 to v0.0.2, which fixes the LFU memory‑leak bug. After the upgrade, the service’s availability returned to the expected level.

Key takeaways:

Third‑party dependencies can introduce subtle bugs; establish standards and monitoring for library updates.

Comprehensive observability (system, component, and Go runtime metrics) and automated profiling (e.g., pprof, Holmes) are essential for rapid incident response.

References include articles on Go off‑CPU analysis, Holmes design, and deep dives into Go pprof.

GoGarbage CollectionTracinggcachePerformance Debuggingservice availability
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.