Big Data 14 min read

Kafka Configuration, Monitoring, and Performance Optimization Best Practices

This article summarizes practical Kafka best‑practice guidelines covering hardware sizing, OS and JVM tuning, disk layout choices, replica and controller settings, broker and topic evaluation, as well as producer and consumer configuration, monitoring metrics, and strategies to prevent data loss.

Big Data Technology & Architecture

May 20, 2019

Kafka Configuration, Monitoring, and Performance Optimization Best Practices

Kafka Basic Configuration and Performance Optimization

This section outlines the essential configuration parameters for a Kafka cluster, focusing on hardware, OS, JVM, and disk layout.

Hardware Requirements

Ensure sufficient CPU, memory, and network capacity; use multiple dedicated disks for Kafka data.

OS Tuning

Page cache should hold all active segments.

File descriptor limit: 100k+.

Disable swapping to avoid disk I/O latency.

TCP and JVM tuning (JDK 8 with G1 GC, 6‑8 GB heap).

Kafka Disk Storage

Prefer multiple disks (JBOD) or RAID‑10.

JBOD drawbacks: single‑disk failure causes broker shutdown, no consistency guarantees, complex directory hierarchy.

RAID‑10 advantages: tolerates one disk failure, balanced load, better performance.

Recommended file systems: EXT or XFS on SSDs.

Basic Monitoring

Key health metrics include CPU load, network traffic, file‑handle usage, disk space and I/O, GC activity, and ZooKeeper status.

Kafka Replica Configuration and Monitoring

Replication Settings

Leader maintains the in‑sync replica set (ISR). replica.lag.time.max.ms: default 10000 ms; follower removed from ISR if lag exceeds this. num.replica.fetchers: default 1, fetcher threads per broker. min.insync.replica: used by producers to guarantee durability.

Under‑Replicated Partitions

Monitor

kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions

. Common causes include broker, controller, ZooKeeper, or network failures. Remedies involve adjusting ISR settings or expanding the broker pool.

Controller

Manages partition lifecycle.

Prevent controller ZK session timeouts (ISR jitter, ZK performance, long GC, network I/O).

Metrics:

kafka.controller:type=KafkaController,name=ActiveControllerCount

(should be 1) and LeaderElectionRate.

Unclean Leader Election

Configuration unclean.leader.election.enable (default true) allows a non‑ISR replica to become leader, favoring availability over correctness. Monitor

kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec

Broker Configuration

log.retention.{ms,minutes,hours}

and log.retention.bytes: data retention. message.max.bytes, replica.fetch.max.bytes. delete.topic.enable: default false. unclean.leader.election.enable: set false to avoid data loss. min.insync.replicas =2: required acknowledgments for producers.

Other important settings: replica.lag.time.max.ms, replica.fetch.response.max.bytes, zookeeper.session.timeout.ms =30s, num.io.threads =8.

Kafka Cluster Evaluation

Broker: keep partitions per broker < 2 k, partition size < 25 GB.

Scale brokers based on data retention period and traffic volume.

Maintain disk usage < 60 % and network usage < 75 %.

Ensure balanced partition distribution and avoid resource exhaustion.

Broker Monitoring

PartitionCount, LeaderCount, IsrExpandsPerSec.

Message in/out rates, NetworkProcessorAvgIdlePercent, RequestHandlerAvgIdlePercent.

Topic Evaluation

Partition count should match or exceed the maximum consumer threads.

Keep partition size around 25 GB and plan for future growth.

Use keyed topics and consider automatic partition expansion.

Partition Sizing Guidelines

Determine required partitions by comparing target throughput (T) with per‑partition producer (P) and consumer (C) throughput; choose the larger of T/P or T/C.

More partitions increase throughput but also raise file‑descriptor usage, potential unavailability, latency, and client memory consumption.

Producer Configuration, Tuning, and Monitoring

Quotas

Set byte‑rate limits for produce/fetch requests to protect against abusive clients.

Producer Settings

Use the Java client and kafka-producer-perf-test.sh for benchmarking.

Key parameters: batch.size, linger.ms, max.in.flight.requests.per.connection (default 5), compression.type, acks.

Avoid large messages to reduce memory pressure and broker load.

Performance Tuning

If throughput is below network capacity, increase threads, raise batch.size, add more producer instances, or add partitions.

When acks=-1 causes latency, increase num.replica.fetchers.

For cross‑data‑center traffic, enlarge socket and OS TCP buffers.

Producer Monitoring

Metrics: batch-size-avg, compression-rate-avg, waiting-threads, buffer-available-bytes, record-queue-time-max, record-send-rate, records-per-request-avg.

Consumer Configuration, Tuning, and Monitoring

Consumer Settings

Test with kafka-consumer-perf-test.sh.

Key parameters: fetch.min.bytes, fetch.max.wait.ms, max.poll.interval.ms, max.poll.records (default 500), session.timeout.ms.

Monitor consumer lag via records-lag-max or bin/kafka-consumer-groups.sh; tools like LinkedIn’s Burrow are useful.

Reduce lag by analyzing GC or hangs, adding consumer threads, or increasing partitions.

Offset Management

Internal topic __consumer_offsets with offsets.topic.replication.factor =3 and offsets.retention.minutes =1440.

Prefer asynchronous or manual commits for faster offset updates.

Ensuring No Data Loss

Producer: block.on.buffer.full=true, retries=Long.MAX_VALUE, acks=all, max.in.flight.requests.per.connection=1, graceful shutdown.

Broker: replication factor ≥ 3, min.insync.replicas=2, disable unclean leader election.

Consumer: ensure min.insync.replicas=2, disable auto‑buffer‑full, commit offsets only after processing.

References: Apache Kafka Best Practices, translated guides, topic/partition sizing articles, and file‑descriptor handling resources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Streaming kafka bigdata

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.