Operations 12 min read

Detect and Fix Goroutine Leaks in Go with Context & pprof

This guide explains how Goroutine leaks cause hidden memory and CPU issues in long‑running Go health‑check tools, demonstrates how to reproduce the problem, and shows step‑by‑step detection using pprof and context, plus a production‑ready zero‑leak probe template with best‑practice code.

Code Wrench
Code Wrench
Code Wrench
Detect and Fix Goroutine Leaks in Go with Context & pprof
Your health‑check tool's memory keeps growing and CPU spikes after running for a while? The culprit is likely Goroutine leaks. This article shows how to use context + pprof to pinpoint ghost Goroutines and provides a zero‑leak probe framework template.

Several Python‑oriented ops engineers asked whether Go can replace Python in operations. The key is having reliable, efficient automation tools. Go’s high performance, strong concurrency, single‑binary deployment, and stability make it a popular choice for next‑generation ops tools.

1. A Silent Failure: Your Tool Is Quietly Out of Control

Imagine you built a lightweight Go tool that periodically probes backend services. It runs thousands of times per day and works fine at first. After a few days, ops reports that a pod’s memory climbs from 50 MB to 1.2 GB and is OOM‑killed multiple times.

Inspecting the pod shows the process RSS skyrocketing and the pod being OOMKilled repeatedly. The code looks simple, yet the memory keeps leaking.

In Go, a Goroutine is a lightweight thread. Because they are cheap to start, developers often forget to manage their lifetimes. A Goroutine blocked forever—e.g., waiting on a never‑responding HTTP request—becomes a “ghost” that never exits, continues to hold its stack (2‑8 KB), and prevents its referenced objects from being garbage‑collected. The GC cannot reclaim objects held by a running Goroutine, so Goroutine leaks often accompany memory leaks.

2. Reproducing the Leak: A Bad Probe Example

The following code illustrates a typical leaky probe:

// leaky_probe.go - probe with leak risk
package main

import (
    "fmt"
    "net/http"
    "time"
)

func badProbe(target string) {
    // Issue 1: No timeout – if target never responds, the goroutine blocks forever
    resp, err := http.Get(target)
    if err != nil {
        fmt.Printf("Error probing %s: %v
", target, err)
        return
    }
    defer resp.Body.Close()
    // Issue 2: Body not read – connection cannot be reused, leading to fd leak
}

func main() {
    targets := []string{"http://slow-or-down-service-1", "http://another-unreachable-service"}
    for _, target := range targets {
        go badProbe(target) // each call may spawn a zombie goroutine
    }
    // Main goroutine sleeps then exits, but child goroutines may still be blocked
    time.Sleep(10 * time.Second)
}

The code has three fatal flaws:

No timeout mechanism – http.Get blocks indefinitely if the target does not respond.

Response body not consumed – without reading resp.Body, the underlying TCP connection cannot be returned to the pool, eventually exhausting file descriptors.

No context control – the main program cannot signal child tasks to stop, leading to uncontrolled concurrency.

When such code runs at high frequency (e.g., probing 100 services per second), thousands of blocked Goroutines can appear within minutes, causing rapid memory and connection‑count growth.

3. Detecting Leaks with pprof

Go’s built‑in pprof tool is ideal for diagnosing this problem. Add the import:

import _ "net/http/pprof"

Start a minimal HTTP server (even locally):

go func() {
    log.Println(http.ListenAndServe("localhost:6060", nil))
}()

When you suspect a leak, capture a Goroutine snapshot:

# Grab all goroutine stacks
go tool pprof http://localhost:6060/debug/pprof/goroutine

# In the pprof interactive console
(pprof) top
(pprof) list badProbe

The output will show a large number of Goroutines stuck in functions such as net/http.(*persistConn).readLoop. Seeing hundreds of such entries confirms a leak and points to the offending function.

4. Fixing Leaks: Production‑Grade Probe Template

To eliminate the leak, combine four techniques: Context control, connection reuse, mandatory body consumption, and concurrent‑task management.

// robust_probe.go - zero‑leak production probe
package main

import (
    "context"
    "fmt"
    "io"
    "log"
    "net/http"
    "time"
    "golang.org/x/sync/errgroup"
)

var httpClient = &http.Client{ // reuse a single client
    Timeout: 5 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 10,
        IdleConnTimeout:    30 * time.Second,
    },
}

func robustProbe(ctx context.Context, target string) error {
    req, err := http.NewRequestWithContext(ctx, http.MethodGet, target, nil)
    if err != nil {
        return fmt.Errorf("create request: %w", err)
    }
    resp, err := httpClient.Do(req)
    if err != nil {
        return fmt.Errorf("probe %s failed: %w", target, err)
    }
    defer resp.Body.Close()
    // Must consume body even if we ignore the content
    _, _ = io.Copy(io.Discard, resp.Body)
    return nil
}

func main() {
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()

    targets := []string{"http://service-a", "http://service-b", "http://unreachable-service"}
    g, ctx := errgroup.WithContext(ctx)
    for _, target := range targets {
        t := target // avoid closure capture issue
        g.Go(func() error { return robustProbe(ctx, t) })
    }
    if err := g.Wait(); err != nil {
        log.Printf("Some probes failed: %v", err)
    }
    log.Println("All probes completed.")
}

This implementation provides four guarantees:

Context timeout and cancellation – all sub‑tasks share a root context; when it expires or is cancelled, every probe stops automatically.

HTTP client reuse – avoids creating a new connection for each request, saving resources.

Mandatory body consumption – ensures connections return to the pool, preventing file‑descriptor leaks.

errgroup for concurrency – collects errors centrally and simplifies synchronization compared to manual sync.WaitGroup.

5. Why Simpler Code Can Be More Dangerous

Short logic often skips timeout handling.

One‑off requests lead developers to ignore client reuse.

Internal tools may omit pprof monitoring, hiding leaks.

System stability is determined not by complexity but by meticulous attention to details such as Goroutine lifetimes, connection management, and byte‑level resource handling.

6. Actionable Steps You Can Take Today

Audit existing tools – search for anonymous go func() calls and verify they receive a context for cancellation.

Enable pprof – expose /debug/pprof in all long‑running Go services (guarded by an environment variable if needed).

Adopt the template – extract the core logic from robust_probe.go into a reusable probe package for team‑wide use.

pprofOpsMemory Leak
Code Wrench
Written by

Code Wrench

Focuses on code debugging, performance optimization, and real-world engineering, sharing efficient development tips and pitfall guides. We break down technical challenges in a down-to-earth style, helping you craft handy tools so every line of code becomes a problem‑solving weapon. 🔧💻

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.