Investigation and Resolution of CPU Spike in a Kafka-Go Consumer Using pprof
Using Go’s pprof, the team traced a gradual CPU spike in a high‑throughput kafka‑go consumer to a saturated commit queue and repeatedly nested context values, which forced costly lookups; eliminating the unnecessary trace‑id context injection (or recreating a fresh context each loop) resolved the issue and reduced CPU usage to under 2 %.
The business team observed that a Kafka consumer component built on an internal framework gradually increased its CPU usage until the CPU was fully saturated, causing the consumer to lose its processing capability. The system processes roughly 7,000 messages per minute across 30 partitions, with six pods each running five goroutines.
Initial monitoring showed that the consumer process alone was responsible for the CPU rise while memory usage remained stable. To pinpoint the cause, the team employed Go's built‑in pprof tool. They added a pprof HTTP listener to the test environment:
import _ "net/http/pprof"
go func() {
ip := "0.0.0.0:6060"
if err := http.ListenAndServe(ip, nil); err != nil {
fmt.Printf("start pprof failed on %s\n", ip)
os.Exit(1)
}
}()After collecting CPU and goroutine profiles, flame graphs revealed that most CPU time was spent in the CommitMessage and FetchMessage methods of the kafka-go library. Further tracing showed that the goroutine was frequently blocked on ctx.Done() , indicating that the context objects were not being released promptly.
Source code inspection of kafka-go (v0.4.38) uncovered the following implementation of CommitMessages :
func (r *Reader) CommitMessages(ctx context.Context, msgs ...Message) error {
if !r.useConsumerGroup() {
return errOnlyAvailableWithGroup
}
var errch <-chan error
creq := commitRequest{commits: makeCommits(msgs...)}
if r.useSyncCommits() {
ch := make(chan error, 1)
errch, creq.errch = ch, ch
}
select {
case r.commits <- creq:
case <-ctx.Done():
return ctx.Err()
case <-r.stctx.Done():
return io.ErrClosedPipe
}
if !r.useSyncCommits() {
return nil
}
select {
case <-ctx.Done():
return ctx.Err()
case err := <-errch:
return err
}
}The analysis highlighted that the internal channel r.commits has a default capacity of 100 (controlled by config.QueueCapacity ). Given the high message rate (≈7,000 per minute), this queue quickly became a bottleneck, causing the producer to block on ctx.Done() and leading to continuous CPU consumption.
Further investigation revealed that each message processing loop repeatedly wrapped the existing context with a new value for traceId :
ctx = context.WithValue(ctx, traceIdKey, traceValue)This pattern created deeply nested context chains. When the consumer later passed the same context to CommitMessage and FetchMessage , the runtime had to traverse the entire chain to retrieve the value, resulting in the observed CPU escalation.
To reproduce the issue, the following minimal program was provided:
package main
import (
"context"
"fmt"
"time"
)
func main() {
ch := make(chan int, 100)
go func() {
for {
<-ch
time.Sleep(10 * time.Millisecond) // simulate business operation
}
}()
ctx := context.Background()
traceValue := 0
for {
traceValue++
ctx = context.WithValue(ctx, "traceId", traceValue) // value write to ctx
select {
case ch <- traceValue: // value write to chan
case <-ctx.Done():
fmt.Println("ctx done")
return
}
}
}Running this program on Linux showed a slow but steady increase in CPU usage, matching the production symptom.
The fix was straightforward: either remove the unnecessary traceId injection or create a fresh context for each loop iteration instead of repeatedly nesting values. After applying the change, the consumer stabilized with CPU usage below 2% and no further message backlog.
Key takeaways include effective use of pprof for CPU/memory analysis, the importance of inspecting upstream open‑source code when troubleshooting, and careful management of context objects to avoid hidden performance penalties.
37 Interactive Technology Team
37 Interactive Technology Center
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.