Backend Development 10 min read

Investigation and Resolution of CPU Spike in a Kafka-Go Consumer Using pprof

Using Go’s pprof, the team traced a gradual CPU spike in a high‑throughput kafka‑go consumer to a saturated commit queue and repeatedly nested context values, which forced costly lookups; eliminating the unnecessary trace‑id context injection (or recreating a fresh context each loop) resolved the issue and reduced CPU usage to under 2 %.

37 Interactive Technology Team

Jul 26, 2023

Investigation and Resolution of CPU Spike in a Kafka-Go Consumer Using pprof

The business team observed that a Kafka consumer component built on an internal framework gradually increased its CPU usage until the CPU was fully saturated, causing the consumer to lose its processing capability. The system processes roughly 7,000 messages per minute across 30 partitions, with six pods each running five goroutines.

Initial monitoring showed that the consumer process alone was responsible for the CPU rise while memory usage remained stable. To pinpoint the cause, the team employed Go's built‑in pprof tool. They added a pprof HTTP listener to the test environment:

import _ "net/http/pprof"
go func() {
    ip := "0.0.0.0:6060"
    if err := http.ListenAndServe(ip, nil); err != nil {
        fmt.Printf("start pprof failed on %s
", ip)
        os.Exit(1)
    }
}()

After collecting CPU and goroutine profiles, flame graphs revealed that most CPU time was spent in the CommitMessage and FetchMessage methods of the kafka-go library. Further tracing showed that the goroutine was frequently blocked on ctx.Done(), indicating that the context objects were not being released promptly.

Source code inspection of kafka-go (v0.4.38) uncovered the following implementation of CommitMessages:

func (r *Reader) CommitMessages(ctx context.Context, msgs ...Message) error {
    if !r.useConsumerGroup() {
        return errOnlyAvailableWithGroup
    }
    var errch <-chan error
    creq := commitRequest{commits: makeCommits(msgs...)}
    if r.useSyncCommits() {
        ch := make(chan error, 1)
        errch, creq.errch = ch, ch
    }
    select {
    case r.commits <- creq:
    case <-ctx.Done():
        return ctx.Err()
    case <-r.stctx.Done():
        return io.ErrClosedPipe
    }
    if !r.useSyncCommits() {
        return nil
    }
    select {
    case <-ctx.Done():
        return ctx.Err()
    case err := <-errch:
        return err
    }
}

The analysis highlighted that the internal channel r.commits has a default capacity of 100 (controlled by config.QueueCapacity). Given the high message rate (≈7,000 per minute), this queue quickly became a bottleneck, causing the producer to block on ctx.Done() and leading to continuous CPU consumption.

Further investigation revealed that each message processing loop repeatedly wrapped the existing context with a new value for traceId: ctx = context.WithValue(ctx, traceIdKey, traceValue) This pattern created deeply nested context chains. When the consumer later passed the same context to CommitMessage and FetchMessage, the runtime had to traverse the entire chain to retrieve the value, resulting in the observed CPU escalation.

To reproduce the issue, the following minimal program was provided:

package main

import (
    "context"
    "fmt"
    "time"
)

func main() {
    ch := make(chan int, 100)
    go func() {
        for {
            <-ch
            time.Sleep(10 * time.Millisecond) // simulate business operation
        }
    }()
    ctx := context.Background()
    traceValue := 0
    for {
        traceValue++
        ctx = context.WithValue(ctx, "traceId", traceValue) // value write to ctx
        select {
        case ch <- traceValue: // value write to chan
        case <-ctx.Done():
            fmt.Println("ctx done")
            return
        }
    }
}

Running this program on Linux showed a slow but steady increase in CPU usage, matching the production symptom.

The fix was straightforward: either remove the unnecessary traceId injection or create a fresh context for each loop iteration instead of repeatedly nesting values. After applying the change, the consumer stabilized with CPU usage below 2% and no further message backlog.

Key takeaways include effective use of pprof for CPU/memory analysis, the importance of inspecting upstream open‑source code when troubleshooting, and careful management of context objects to avoid hidden performance penalties.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Go pprof Kafka consumer Context CPU profiling

Written by

37 Interactive Technology Team

37 Interactive Technology Center

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.