Master Kafka: Core Concepts, Metrics, and Troubleshooting Guide
This article explains Kafka's fundamental components, version evolution, key monitoring metrics for producers, brokers, consumers and Zookeeper, and provides step‑by‑step troubleshooting methods for common issues such as slow topic throughput and message backlog.
Kafka Core Concepts
Broker: a Kafka server node.
Cluster: a set of multiple brokers.
Message (Record): the data carrier in Kafka.
Producer: application that sends messages to Kafka.
Consumer: application that receives messages from Kafka.
Consumer Group: a group of consumers sharing the same Group ID; each message is delivered to only one consumer in the group, enabling load‑balanced parallel consumption.
Topic: a category for messages; multiple topics can be created on each broker.
Replica: each partition has multiple replicas; if the leader fails, a follower becomes the new leader. Default maximum replicas is 10, and the number cannot exceed the number of brokers.
Partition: an ordered, immutable sequence of messages; a topic consists of one or more partitions, stored across one or more brokers. Message order within a partition matches the producer's send order.
Offset: a monotonically increasing, immutable position identifier for each message in a partition.
Consumer offset: the highest offset a consumer has processed in a partition.
Backlog: the total number of messages pending consumption, calculated as (latest offset – consumer offset). A large backlog indicates possible consumer blockage or slower consumption than production.
Rebalance: when a consumer instance in a group fails, remaining consumers automatically redistribute the assigned partitions, ensuring high availability.
Zookeeper: stores Kafka cluster metadata to guarantee availability (pre‑3.0 versions).
Kafka Versions
0.1x – early incubation version.
1.x – optimized Streams API, improved observability and debugging, Java 9 support, SASL enhancements.
2.x – performance improvements, enhanced ACL support.
3.x – removed Zookeeper dependency, supports Java 17, drops Java 8, deprecates v0 and v1 messages, major performance boost.
Recommended versions: 2.x and 3.x.
Kafka Metric Monitoring
Producer Metrics
Producers push messages to broker topics; if producers fail, consumers receive no new data.
Broker Metrics
All messages flow through brokers, making broker monitoring critical.
Consumer Metrics
Consumers are the endpoint of Kafka messages.
Zookeeper Metrics
Zookeeper is a critical component; before Kafka 3.0, its failure stops the whole cluster.
Typical Kafka Issues and Diagnosis
Topic Message Slow and Low Concurrency
Symptoms: high producer request latency and low throughput.
Common causes:
Insufficient network bandwidth causing IO wait.
Uncompressed messages leading to excessive traffic.
Missing or misconfigured batch settings.
Too few topic partitions causing broker backlog.
Low broker disk performance.
Excessive total partitions causing fragmentation and disk overload.
Diagnosis steps:
Check producer average IO wait time.
Verify producer compression ratio.
Inspect average request size; increase batch.size and adjust linger.ms if needed.
Review partition count; increase to scale horizontally.
Monitor broker disk IO usage; consider scaling vertically or horizontally.
Ensure total partition count per broker stays within capacity limits.
Topic Message Backlog
Symptoms: growing backlog reflected by increasing consumer group lag.
Common causes:
Increased producer message rate.
Consumer slowdown due to business changes.
Insufficient number of consumers.
Frequent consumer count changes triggering rebalance.
Brokers not receiving consumer acknowledgments.
Diagnosis steps:
Check producer production rate metrics.
Verify consumer consumption rate metrics.
Use Kafka broker commands to compare expected vs. actual consumer counts.
Observe consumer count fluctuations causing rebalance.
Confirm network stability between consumers and brokers; restart consumers if necessary.
Source: https://blog.51cto.com/u_11555417/6177911 (© original author)
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
