How to Diagnose Uneven CPU Usage in Java Services Using Kafka

This article walks through the symptoms, root cause analysis, and step‑by‑step solutions for uneven CPU usage across Java service instances, highlighting how mismatched Kafka partition counts and thread or GC issues can lead to load imbalance and how to resolve them.

Java Baker
Java Baker
Java Baker
How to Diagnose Uneven CPU Usage in Java Services Using Kafka

Phenomenon

Multiple service instances exhibit uneven CPU usage: some consistently high, others low; restarting an instance does not immediately normalize the distribution.

Root Cause Identification

Comparison of abnormal and normal instances shows identical interface QPS but divergent Kafka consumption QPS. High‑CPU instances process Kafka messages at higher QPS, while low‑CPU instances have no consumption.

Each Kafka topic is divided into partitions; a partition can be consumed by only one instance in a consumer group. When the number of service instances exceeds the number of partitions, only a subset receives partitions, processes messages, and experiences high CPU. Instances without assigned partitions remain idle. Once the topic is drained of new messages, CPU usage evens out.

Historical Changes

Initial deployment: low interface QPS, ~10 instances.

Added a scheduled consumption task; created the Kafka topic with partitions equal to the instance count (10), so imbalance was not observable.

Traffic growth led to scaling the service to ~100 instances, exposing the partition‑to‑instance mismatch.

Solutions

Quick Fix – Increase Kafka Partitions

Adding partitions allows more instances to receive work, eliminating the immediate CPU imbalance. The change only affects newly produced messages; existing messages retain their original partition assignment. Partition count does not automatically scale with instance count.

Permanent Fix – Split Services

Separate the consumer‑task service from the business API service. Keep the consumer service at a relatively fixed instance count n and provision Kafka partitions as a multiple of n (e.g., 2n) so each consumer receives an even share of partitions. The API service can scale independently with traffic.

Kafka Core Concepts Recap

Topic and Partition

A topic is a logical queue divided into partitions, the smallest unit for storage and parallel consumption. More partitions increase concurrency. Within a partition messages are strictly ordered; across partitions they are unordered. Producers can direct messages to a specific partition using a key.

Consumer Group and Rebalance

A consumer group consists of cooperating consumer instances. At any moment a partition can be consumed by only one instance in the group. When consumers join or leave, or when partition counts change, a rebalance redistributes partitions among the instances.

Offset and Poll‑Interval Risk

Each consumer group tracks offsets per partition. If processing a single message exceeds max.poll.interval.ms (default 5 minutes), the consumer is expelled from the group, triggering a rebalance that can degrade performance.

General Troubleshooting Flow for Uneven CPU Usage

Collect metrics for a recent fixed window (e.g., the last 5 minutes) and compare them across all problematic instances; differences point to the root cause.

Step 1 – Compare Request Volume

Verify that load is balanced and that requests truly reach each instance. Check interface QPS, consumption QPS, and look for hotspots. Typical alternative causes include producer key skew that concentrates messages in a single partition and upstream load‑balancer failures.

Step 2 – Compare Thread Status

If request volume is equal but CPU differs, examine thread pools and blocking conditions:

Insufficient RPC, custom, or database connection thread pool sizes.

Blocking on downstream services or storage timeouts.

High contention on CAS optimistic locks causing spin loops.

Deadlocks.

Step 3 – Compare GC State

If threads appear normal, inspect garbage collection:

Slower or more frequent GC cycles.

Presence of large objects.

Heap size adequacy.

JVM parameter settings.

Step 4 – Restart Service

When no clear direction emerges, restarting the service may resolve underlying container or host issues.

Summary

Design architectures that separate consumption tasks from online business logic.

Maintain comprehensive monitoring; when CPU usage is uneven, the fastest way to locate the root cause is to compare differences across instances.

Root‑cause steps: compare request volume, thread status, GC state, then consider restarting the service.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaMonitoringPerformanceKafkaTroubleshootingCPU
Java Baker
Written by

Java Baker

Java architect and Raspberry Pi enthusiast, dedicated to writing high-quality technical articles; the same name is used across major platforms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.