Operations 10 min read

Understanding Kafka’s UnderReplicatedPartitions Metric for Effective Monitoring

This article explains how to enable JMX for Kafka, retrieve and interpret key metrics such as UnderReplicatedPartitions, and troubleshoot common issues like broker failures, disk outages, and replica lag by examining metric values and related logs.

ShiZhen AI
ShiZhen AI
ShiZhen AI
Understanding Kafka’s UnderReplicatedPartitions Metric for Effective Monitoring

Enable remote JMX

Set JMX_PORT in the environment before starting the broker:

JMX_PORT=9999 nohup bin/kafka-server-start.sh config/server.properties &

Export JMX_PORT inside kafka-server-start.sh (add export JMX_PORT="9999" before the Java launch command).

Add the standard JMX system properties to the IDEA run configuration when launching the Kafka source code:

-Djava.rmi.server.hostname=127.0.0.1
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=9999
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false

In production enable JMX security (authentication and SSL) to prevent unauthorized access.

Locate the JMX port

After JMX is enabled the broker registers its port under the Zookeeper node /brokers/ids/{brokerID}. Example JSON snippet from Zookeeper shows the registered port:

{
  "features": {},
  "listener_security_protocol_map": {"PLAINTEXT": "PLAINTEXT"},
  "endpoints": ["PLAINTEXT://localhost:9092"],
  "jmx_port": 9999,
  "port": 9092,
  "host": "localhost",
  "version": 5,
  "timestamp": "1659670870502"
}

Connect with jconsole

Run the JDK tool: shizhenzhen@localhost % jconsole Enter the host and JMX port (local or remote). After connecting, select the MBean tab to view all exposed metrics.

Metric attributes

Each metric exposes a set of attributes:

RateUnit : time unit, always SECONDS.

EventType : e.g., messages for message‑related metrics.

Count : total number of events since the broker started.

MeanRate : average rate since the metric was created.

OneMinuteRate , FiveMinuteRate , FifteenMinuteRate : exponentially weighted moving averages over the respective time windows.

Example metric: MessagesInPerSec

Object name kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec reports the inbound message rate. The OneMinuteRate attribute is commonly used to obtain the per‑second average ingress speed.

UnderReplicatedPartitions metric

Object name

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions

counts leader partitions whose replica set is not fully in‑sync with the ISR (i.e., replicationFactor - isr.size > 0).

leaderPartitionsIterator.count(_.isUnderReplicated)

def isUnderReplicated: Boolean = isLeader && (assignmentState.replicationFactor - isrState.isr.size) > 0

The metric is a Gauge , therefore its value reflects the current number of such partitions.

Problem analysis

Broker failure : When a broker goes down, other brokers show a spike in UnderReplicatedPartitions because their leader partitions lose followers.

Disk problems : Offline or full log directories cause replicas to become unavailable. The metric kafka.log:type=LogManager,name=OfflineLogDirectoryCount reports the number of offline directories. Individual directories can be inspected via

kafka.log:type=LogManager,name=LogDirectoryOffline,logDirectory="..."

.

Performance bottlenecks : Slow follower replication (e.g., GC pauses or I/O saturation) leads to ISR drop‑out. Diagnosis can use GC logs ( kafkaServer-gc.log) and fetch error logs such as Error sending fetch request ... or Failed to connect within $socketTimeout ms.

Remediation

Increase replica.lag.time.max.ms (default 10 s, later 30 s) to give followers more time before being removed from ISR.

Increase num.replica.fetchers (default 1) to raise I/O parallelism for follower fetchers.

Monitor OfflineLogDirectoryCount and per‑directory LogDirectoryOffline metrics to detect and recover offline log directories.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringmetricsreplicajmxunderreplicatedpartitions
ShiZhen AI
Written by

ShiZhen AI

Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.