Why Is My Go Health‑Check So Slow? Diagnosing TCP Latency and GC Overhead
This article investigates why a Go‑based service health‑check system experiences high latency, examines differences from Nginx checks, runs experiments on physical machines and Docker, and explores goroutine scheduling, GOMAXPROCS, and garbage‑collection tuning to reduce average response time from 40 ms to under 10 ms.
Background
The health‑check system periodically sends TCP connection requests to target servers and removes a target from the registry after a certain number of consecutive failures. The observed latency ranged from a few milliseconds to several hundred milliseconds, averaging over 40 ms, which is unusually high for an internal TCP handshake.
Necessity
Many services, including Nginx's built‑in active health checks, need to migrate to this system. Nginx’s default timeout (50‑100 ms) would be exceeded by the higher latency, causing false removals. Adjusting the timeout is not a viable solution because uneven load or node jitter could hide real failures.
Monitoring
All standard metrics (CPU, memory, disk, network) appeared normal; only the health‑check latency was abnormal.
Differences from Nginx
Nginx is written in C; our program is written in Go.
Nginx runs on bare metal; our program runs inside Docker containers.
Nginx checks a relatively small number of services, whereas our program may need to probe tens of thousands of targets per node.
Experiments
Two small experiments were conducted:
Deploy the health‑check on a physical machine and compare latency to the same target running inside Docker – the physical machine showed only a few milliseconds latency.
Deploy the service on another Docker host and observe similar low latency.
These results suggest that latency is related to the scale of concurrent checks, and the Go implementation has the potential to match Nginx’s performance.
Suspecting Goroutine Scheduling
Each target check spawns a goroutine, so the large number of goroutines might cause scheduling overhead. Using Go Trace (available in Go ≥ 1.5 with pprof enabled) we collected scheduling data:
curl -o trace.dump 'http://127.0.0.1:8600/debug/pprof/trace?seconds=30'and then viewed it with: go tool trace trace.dump The trace showed that a single goroutine spent about 300 ms on scheduling over a 30‑second period.
GOMAXPROCS Issue
Inside containers, Go determines the number of processors from /proc/cpuinfo, which reflects the host’s CPU count, not the container’s limit. This mismatch can cause excessive runtime processors, leading to extra find‑runnable work and thread context switches.
Using the uber-go/automaxprocs library automatically sets GOMAXPROCS to the correct value for the container. After applying it, latency did not change noticeably.
Suspecting Garbage Collection
High check volume also increases memory allocation and GC pressure. Tracing the goroutine that establishes connections revealed that GC pauses dominated the timeline, sometimes reaching 100 % of the pause time.
Two main contributors were identified:
Debug logging.
Metric reporting.
Disabling debug logs reduced average latency from 40 ms to 30 ms.
GC Parameter Tuning
The only tunable GC parameter in Go is GOGC, which controls the heap growth threshold before a GC cycle. Experiments with values from 100 to 1000 showed that setting GOGC=500 gave the best result, lowering average latency from 10 ms to 8 ms.
Conclusion
Through systematic investigation—examining hardware vs. container deployment, goroutine scheduling, processor count, and GC behavior—the health‑check latency was reduced from an average of 40 ms to 8 ms, with the worst‑case dropping from 120 ms to 10 ms. Further improvements can be achieved by scaling resources, but the current optimizations already meet the required performance.
Xiao Lou's Tech Notes
Backend technology sharing, architecture design, performance optimization, source code reading, troubleshooting, and pitfall practices
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
