Online Service Alarm Handling and Performance Profiling in Go
The article outlines a systematic SOP‑driven approach for diagnosing online service alarms and performance issues in Go, detailing a toolbox that includes pprof, trace, goroutine visualizers, perf and eBPF, and recommends application‑level optimizations, system tuning, and continuous profiling to accelerate root‑cause identification and reduce incident frequency.
Background : When an online service triggers an alarm or exhibits mysterious performance issues, systematic diagnosis is essential. This article shares a practical methodology and toolchain for rapid root‑cause identification.
Alarm Investigation Process : Establish a standard SOP to break down incidents, communicate involvement early, and prioritize quick mitigation (restart, rollback) while gathering service ownership and resource metrics.
SOP Documentation : A set of SOPs covering service call exceptions, latency spikes, circuit‑breaker issues, MySQL/Redis latency, CPU/memory anomalies, traffic surges, and common business problems. Each SOP includes owners, tool links, and a “no‑search, no‑ask” principle.
Performance Diagnosis Toolbox :
pprof – Go’s primary CPU and memory profiler. Use runtime/pprof for embedded services or net/http/pprof for HTTP endpoints. Examine cumulative (cum) and flat costs to locate hot functions.
trace – Capture runtime events (goroutine scheduling, GC pauses) via curl host/debug/pprof/trace?seconds=10 > trace.out and analyze with go tool trace trace.out .
Goroutine visualization – Tools like divan/gotrace render execution graphs.
perf – System‑level profiling when pprof fails, showing symbol‑level hotspots.
eBPF – Dynamic, non‑intrusive tracing for kernel‑level insights; useful when Go‑level tools are insufficient.
Example Go code used with eBPF:
package main
import "fmt"
func main() {
fmt.Println("Hello, BPF!")
}
# funclatency 'go:fmt.Println'
Tracing 1 functions for "go:fmt.Println"... Hit Ctrl-C to end.
^C
Function = fmt.Println [3041]
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 27 |****************************************|
16384 -> 32767 : 3 |**** |
Detaching...Optimization Strategies :
Application Layer : Pool resources (sync.Pool), tighten lock scopes, replace heavy JSON libraries, follow fasthttp best practices.
System Layer : Upgrade Go version, tune OS parameters (swap, NUMA), refer to Red Hat tuning guides.
Continuous Profiling : Periodic pprof collection (cron) with archiving enables time‑range analysis and diffing. Tools like Conprof provide a UI for this workflow.
eBPF + Go : Leverage eBPF for low‑overhead tracing of function latency and call stacks when traditional profilers are unavailable.
Conclusion : Effective incident handling combines SOPs, robust tooling, and disciplined benchmarking. Continuous profiling and proactive code reviews further reduce incident frequency.
DeWu Technology
A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.