Understanding Kafka’s UnderReplicatedPartitions Metric for Effective Monitoring
This article explains how to enable JMX for Kafka, retrieve and interpret key metrics such as UnderReplicatedPartitions, and troubleshoot common issues like broker failures, disk outages, and replica lag by examining metric values and related logs.
Enable remote JMX
Set JMX_PORT in the environment before starting the broker:
JMX_PORT=9999 nohup bin/kafka-server-start.sh config/server.properties &Export JMX_PORT inside kafka-server-start.sh (add export JMX_PORT="9999" before the Java launch command).
Add the standard JMX system properties to the IDEA run configuration when launching the Kafka source code:
-Djava.rmi.server.hostname=127.0.0.1
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=9999
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=falseIn production enable JMX security (authentication and SSL) to prevent unauthorized access.
Locate the JMX port
After JMX is enabled the broker registers its port under the Zookeeper node /brokers/ids/{brokerID}. Example JSON snippet from Zookeeper shows the registered port:
{
"features": {},
"listener_security_protocol_map": {"PLAINTEXT": "PLAINTEXT"},
"endpoints": ["PLAINTEXT://localhost:9092"],
"jmx_port": 9999,
"port": 9092,
"host": "localhost",
"version": 5,
"timestamp": "1659670870502"
}Connect with jconsole
Run the JDK tool: shizhenzhen@localhost % jconsole Enter the host and JMX port (local or remote). After connecting, select the MBean tab to view all exposed metrics.
Metric attributes
Each metric exposes a set of attributes:
RateUnit : time unit, always SECONDS.
EventType : e.g., messages for message‑related metrics.
Count : total number of events since the broker started.
MeanRate : average rate since the metric was created.
OneMinuteRate , FiveMinuteRate , FifteenMinuteRate : exponentially weighted moving averages over the respective time windows.
Example metric: MessagesInPerSec
Object name kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec reports the inbound message rate. The OneMinuteRate attribute is commonly used to obtain the per‑second average ingress speed.
UnderReplicatedPartitions metric
Object name
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitionscounts leader partitions whose replica set is not fully in‑sync with the ISR (i.e., replicationFactor - isr.size > 0).
leaderPartitionsIterator.count(_.isUnderReplicated)
def isUnderReplicated: Boolean = isLeader && (assignmentState.replicationFactor - isrState.isr.size) > 0The metric is a Gauge , therefore its value reflects the current number of such partitions.
Problem analysis
Broker failure : When a broker goes down, other brokers show a spike in UnderReplicatedPartitions because their leader partitions lose followers.
Disk problems : Offline or full log directories cause replicas to become unavailable. The metric kafka.log:type=LogManager,name=OfflineLogDirectoryCount reports the number of offline directories. Individual directories can be inspected via
kafka.log:type=LogManager,name=LogDirectoryOffline,logDirectory="...".
Performance bottlenecks : Slow follower replication (e.g., GC pauses or I/O saturation) leads to ISR drop‑out. Diagnosis can use GC logs ( kafkaServer-gc.log) and fetch error logs such as Error sending fetch request ... or Failed to connect within $socketTimeout ms.
Remediation
Increase replica.lag.time.max.ms (default 10 s, later 30 s) to give followers more time before being removed from ISR.
Increase num.replica.fetchers (default 1) to raise I/O parallelism for follower fetchers.
Monitor OfflineLogDirectoryCount and per‑directory LogDirectoryOffline metrics to detect and recover offline log directories.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ShiZhen AI
Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
