How to Diagnose Uneven CPU Usage in Java Services Using Kafka
This article walks through the symptoms, root cause analysis, and step‑by‑step solutions for uneven CPU usage across Java service instances, highlighting how mismatched Kafka partition counts and thread or GC issues can lead to load imbalance and how to resolve them.
Phenomenon
Multiple service instances exhibit uneven CPU usage: some consistently high, others low; restarting an instance does not immediately normalize the distribution.
Root Cause Identification
Comparison of abnormal and normal instances shows identical interface QPS but divergent Kafka consumption QPS. High‑CPU instances process Kafka messages at higher QPS, while low‑CPU instances have no consumption.
Each Kafka topic is divided into partitions; a partition can be consumed by only one instance in a consumer group. When the number of service instances exceeds the number of partitions, only a subset receives partitions, processes messages, and experiences high CPU. Instances without assigned partitions remain idle. Once the topic is drained of new messages, CPU usage evens out.
Historical Changes
Initial deployment: low interface QPS, ~10 instances.
Added a scheduled consumption task; created the Kafka topic with partitions equal to the instance count (10), so imbalance was not observable.
Traffic growth led to scaling the service to ~100 instances, exposing the partition‑to‑instance mismatch.
Solutions
Quick Fix – Increase Kafka Partitions
Adding partitions allows more instances to receive work, eliminating the immediate CPU imbalance. The change only affects newly produced messages; existing messages retain their original partition assignment. Partition count does not automatically scale with instance count.
Permanent Fix – Split Services
Separate the consumer‑task service from the business API service. Keep the consumer service at a relatively fixed instance count n and provision Kafka partitions as a multiple of n (e.g., 2n) so each consumer receives an even share of partitions. The API service can scale independently with traffic.
Kafka Core Concepts Recap
Topic and Partition
A topic is a logical queue divided into partitions, the smallest unit for storage and parallel consumption. More partitions increase concurrency. Within a partition messages are strictly ordered; across partitions they are unordered. Producers can direct messages to a specific partition using a key.
Consumer Group and Rebalance
A consumer group consists of cooperating consumer instances. At any moment a partition can be consumed by only one instance in the group. When consumers join or leave, or when partition counts change, a rebalance redistributes partitions among the instances.
Offset and Poll‑Interval Risk
Each consumer group tracks offsets per partition. If processing a single message exceeds max.poll.interval.ms (default 5 minutes), the consumer is expelled from the group, triggering a rebalance that can degrade performance.
General Troubleshooting Flow for Uneven CPU Usage
Collect metrics for a recent fixed window (e.g., the last 5 minutes) and compare them across all problematic instances; differences point to the root cause.
Step 1 – Compare Request Volume
Verify that load is balanced and that requests truly reach each instance. Check interface QPS, consumption QPS, and look for hotspots. Typical alternative causes include producer key skew that concentrates messages in a single partition and upstream load‑balancer failures.
Step 2 – Compare Thread Status
If request volume is equal but CPU differs, examine thread pools and blocking conditions:
Insufficient RPC, custom, or database connection thread pool sizes.
Blocking on downstream services or storage timeouts.
High contention on CAS optimistic locks causing spin loops.
Deadlocks.
Step 3 – Compare GC State
If threads appear normal, inspect garbage collection:
Slower or more frequent GC cycles.
Presence of large objects.
Heap size adequacy.
JVM parameter settings.
Step 4 – Restart Service
When no clear direction emerges, restarting the service may resolve underlying container or host issues.
Summary
Design architectures that separate consumption tasks from online business logic.
Maintain comprehensive monitoring; when CPU usage is uneven, the fastest way to locate the root cause is to compare differences across instances.
Root‑cause steps: compare request volume, thread status, GC state, then consider restarting the service.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Baker
Java architect and Raspberry Pi enthusiast, dedicated to writing high-quality technical articles; the same name is used across major platforms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
