Backend Development 10 min read

Root Cause Analysis and Resolution of Service Availability Fluctuations in a High‑QPS Go Backend

This article details the systematic investigation of intermittent availability drops in a high‑throughput Go service, covering hypothesis formulation, extensive profiling with pprof, gctrace, strace, fgprof, go trace, heap analysis, the discovery of a gcache LFU bug, and the final remediation steps.

Architect
Architect
Architect
Root Cause Analysis and Resolution of Service Availability Fluctuations in a High‑QPS Go Backend

Background : A high‑traffic service handling over 100k QPS experienced occasional availability dips below 0.999, causing timeouts in a critical reservation API during peak periods.

Initial Hypotheses : Potential causes considered included business logic performance issues, abnormal traffic, CPU throttling, middleware (Redis, MySQL), Go scheduler, and GC problems.

Investigation Steps :

Collected pprof data (CPU and memory) to check for CPU hotspots and GC anomalies.

Enabled GODEBUG=gctrace=1 to capture GC traces.

Added fgprof to detect off‑CPU time such as I/O, locks, or paging.

Captured Go trace files to examine scheduler behavior, GC, and pre‑emptive scheduling.

Used strace on the process to inspect system calls.

Analysis Findings :

gctrace : Concurrent mark and sweep phases consumed ~860 ms, but no clear increase compared to stable periods.

strace : No abnormal system calls were observed.

fgprof : No off‑CPU bottlenecks detected.

Go trace : Repeated MARK ASSIST events indicated noticeable GC activity.

Heap analysis : The gcache LFU implementation generated >1 M objects, accounting for a large portion of in‑use objects and memory.

Root Cause : A memory leak in the gcache library’s LFU Get method (see https://github.com/bluele/gcache/issues/71) caused excessive object allocation, leading to GC pressure and request timeouts.

Resolution : Upgraded gcache from v0.0.1 to v0.0.2, eliminating the LFU memory leak and restoring service availability.

Summary & Lessons :

Third‑party library bugs can critically impact system stability; maintain vigilant monitoring of upstream issues.

Comprehensive observability (system, component, and Go runtime metrics) is essential for rapid root‑cause analysis.

Automated profiling (e.g., periodic pprof/trace collection) helps capture transient incidents.

backendGopproftracingGCgcachePerformance Debugging
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.