Kafka Configuration, Monitoring, and Performance Optimization Best Practices
This article summarizes practical Kafka best‑practice guidelines covering hardware sizing, OS and JVM tuning, disk layout choices, replica and controller settings, broker and topic evaluation, as well as producer and consumer configuration, monitoring metrics, and strategies to prevent data loss.
Kafka Basic Configuration and Performance Optimization
This section outlines the essential configuration parameters for a Kafka cluster, focusing on hardware, OS, JVM, and disk layout.
Hardware Requirements
Ensure sufficient CPU, memory, and network capacity; use multiple dedicated disks for Kafka data.
OS Tuning
Page cache should hold all active segments.
File descriptor limit: 100k+.
Disable swapping to avoid disk I/O latency.
TCP and JVM tuning (JDK 8 with G1 GC, 6‑8 GB heap).
Kafka Disk Storage
Prefer multiple disks (JBOD) or RAID‑10.
JBOD drawbacks: single‑disk failure causes broker shutdown, no consistency guarantees, complex directory hierarchy.
RAID‑10 advantages: tolerates one disk failure, balanced load, better performance.
Recommended file systems: EXT or XFS on SSDs.
Basic Monitoring
Key health metrics include CPU load, network traffic, file‑handle usage, disk space and I/O, GC activity, and ZooKeeper status.
Kafka Replica Configuration and Monitoring
Replication Settings
Leader maintains the in‑sync replica set (ISR). replica.lag.time.max.ms: default 10000 ms; follower removed from ISR if lag exceeds this. num.replica.fetchers: default 1, fetcher threads per broker. min.insync.replica: used by producers to guarantee durability.
Under‑Replicated Partitions
Monitor
kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions. Common causes include broker, controller, ZooKeeper, or network failures. Remedies involve adjusting ISR settings or expanding the broker pool.
Controller
Manages partition lifecycle.
Prevent controller ZK session timeouts (ISR jitter, ZK performance, long GC, network I/O).
Metrics:
kafka.controller:type=KafkaController,name=ActiveControllerCount(should be 1) and LeaderElectionRate.
Unclean Leader Election
Configuration unclean.leader.election.enable (default true) allows a non‑ISR replica to become leader, favoring availability over correctness. Monitor
kafka.controller:type=ControllerStats,name=UncleanLeaderElectionsPerSec.
Broker Configuration
log.retention.{ms,minutes,hours}and log.retention.bytes: data retention. message.max.bytes, replica.fetch.max.bytes. delete.topic.enable: default false. unclean.leader.election.enable: set false to avoid data loss. min.insync.replicas =2: required acknowledgments for producers.
Other important settings: replica.lag.time.max.ms, replica.fetch.response.max.bytes, zookeeper.session.timeout.ms =30s, num.io.threads =8.
Kafka Cluster Evaluation
Broker: keep partitions per broker < 2 k, partition size < 25 GB.
Scale brokers based on data retention period and traffic volume.
Maintain disk usage < 60 % and network usage < 75 %.
Ensure balanced partition distribution and avoid resource exhaustion.
Broker Monitoring
PartitionCount, LeaderCount, IsrExpandsPerSec.
Message in/out rates, NetworkProcessorAvgIdlePercent, RequestHandlerAvgIdlePercent.
Topic Evaluation
Partition count should match or exceed the maximum consumer threads.
Keep partition size around 25 GB and plan for future growth.
Use keyed topics and consider automatic partition expansion.
Partition Sizing Guidelines
Determine required partitions by comparing target throughput (T) with per‑partition producer (P) and consumer (C) throughput; choose the larger of T/P or T/C.
More partitions increase throughput but also raise file‑descriptor usage, potential unavailability, latency, and client memory consumption.
Producer Configuration, Tuning, and Monitoring
Quotas
Set byte‑rate limits for produce/fetch requests to protect against abusive clients.
Producer Settings
Use the Java client and kafka-producer-perf-test.sh for benchmarking.
Key parameters: batch.size, linger.ms, max.in.flight.requests.per.connection (default 5), compression.type, acks.
Avoid large messages to reduce memory pressure and broker load.
Performance Tuning
If throughput is below network capacity, increase threads, raise batch.size, add more producer instances, or add partitions.
When acks=-1 causes latency, increase num.replica.fetchers.
For cross‑data‑center traffic, enlarge socket and OS TCP buffers.
Producer Monitoring
Metrics: batch-size-avg, compression-rate-avg, waiting-threads, buffer-available-bytes, record-queue-time-max, record-send-rate, records-per-request-avg.
Consumer Configuration, Tuning, and Monitoring
Consumer Settings
Test with kafka-consumer-perf-test.sh.
Key parameters: fetch.min.bytes, fetch.max.wait.ms, max.poll.interval.ms, max.poll.records (default 500), session.timeout.ms.
Monitor consumer lag via records-lag-max or bin/kafka-consumer-groups.sh; tools like LinkedIn’s Burrow are useful.
Reduce lag by analyzing GC or hangs, adding consumer threads, or increasing partitions.
Offset Management
Internal topic __consumer_offsets with offsets.topic.replication.factor =3 and offsets.retention.minutes =1440.
Prefer asynchronous or manual commits for faster offset updates.
Ensuring No Data Loss
Producer: block.on.buffer.full=true, retries=Long.MAX_VALUE, acks=all, max.in.flight.requests.per.connection=1, graceful shutdown.
Broker: replication factor ≥ 3, min.insync.replicas=2, disable unclean leader election.
Consumer: ensure min.insync.replicas=2, disable auto‑buffer‑full, commit offsets only after processing.
References: Apache Kafka Best Practices, translated guides, topic/partition sizing articles, and file‑descriptor handling resources.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
