How to Diagnose and Resolve Kafka Message Backlog Issues

This article explains what Kafka message backlog is, outlines the main reasons it occurs—such as producer speed outpacing consumers, slow consumer processing, and downstream bottlenecks—and provides practical steps for producer throttling, consumer scaling and logic improvements, and Kafka cluster enhancements to eliminate the backlog.

mikechen
mikechen
mikechen
How to Diagnose and Resolve Kafka Message Backlog Issues

Definition of Kafka Message Backlog

Kafka message backlog (also called Message Backlog) is the situation where the number of records produced to a topic exceeds the number of records that have been successfully consumed and committed. The broker retains the unconsumed records, causing the topic’s log end offset to diverge from the consumer group’s committed offset.

Root Causes

Producer rate higher than consumer capacity – sudden spikes or sustained high throughput from producers generate more records than the consumer group can process.

Consumer slowdown or failure – limited CPU, memory, or I/O on the consumer host; complex per‑message logic; frequent database calls; or thread contention that reduces the effective consumption rate.

Downstream system bottleneck – after processing, consumers forward data to another system (e.g., a database or micro‑service). If that downstream system cannot keep up, consumers block and the backlog grows.

Mitigation Strategies

1. Optimize the Producer Side

Apply rate‑limiting to keep the production rate below the consumer’s sustainable throughput. Common algorithms include:

Token‑bucket – allows bursts up to a configured bucket size while enforcing an average rate.

Leaky‑bucket – smooths traffic to a fixed output rate.

Pre‑warm traffic before expected peaks by estimating the write volume and gradually ramping up producers.

Consider back‑pressure mechanisms (e.g., Kafka’s acks=all with appropriate max.in.flight.requests.per.connection) to avoid overwhelming the broker.

2. Optimize the Consumer Side

Scale consumer instances – add more members to the consumer group so that each partition can be processed in parallel. Ensure the number of partitions is at least equal to the number of consumer instances.

Improve processing logic – reduce per‑message latency by:

Eliminating heavy computations inside the poll loop.

Caching frequently accessed data to avoid repeated DB lookups.

Off‑loading non‑critical work to asynchronous workers or thread pools.

Using batch fetch ( max.poll.records) and batch writes to downstream systems.

Enable multithreaded consumption when the processing model permits, but keep thread‑safety of the Kafka consumer client in mind (e.g., one consumer per thread or use the KafkaConsumer in a single thread and hand records to a worker pool).

Profile the consumer application (e.g., Java Flight Recorder, async-profiler) to locate CPU or I/O hotspots and refactor the code accordingly.

3. Optimize the Kafka Cluster

Increase broker count to raise aggregate write/read throughput and provide more replication capacity.

Distribute topics and partitions evenly across brokers; use the kafka-reassign-partitions.sh tool for rebalancing.

Upgrade storage to SSDs to reduce write latency and improve log segment flush times.

Provision higher network bandwidth (10 GbE or better) to avoid replication bottlenecks, especially for high‑replication‑factor topics.

Adjust broker configuration parameters such as num.io.threads, socket.request.max.bytes, and replica.fetch.max.bytes to match workload characteristics.

In practice, select the combination of these techniques that matches the business scenario, system architecture, and resource constraints. Continuous monitoring of producer lag ( consumer_lag metrics), broker health, and consumer throughput is essential to detect backlog early and apply corrective actions before it impacts downstream services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backend DevelopmentKafkamessage backlog
mikechen
Written by

mikechen

Over a decade of BAT architecture experience, shared generously!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.