20 Proven Kafka Best Practices for High‑Throughput Clusters

This guide presents New Relic’s 20 practical best‑practice recommendations—covering partitions, consumers, producers, and brokers—to help engineers design, tune, and monitor Apache Kafka deployments for reliable, high‑throughput performance.

dbaplus Community
dbaplus Community
dbaplus Community
20 Proven Kafka Best Practices for High‑Throughput Clusters

Apache Kafka is a widely adopted distributed streaming platform used by large companies such as New Relic, Uber, and Square to build scalable, high‑throughput, and highly reliable real‑time data pipelines. In production, Kafka clusters can process over 15 million messages per second and approach 1 Tbps of data aggregation, but operating at this scale introduces complexity.

Why Kafka Can Be Complex

If consumers cannot keep up, messages may be lost before they are processed. Limitations in automated data retention, high‑volume publish‑subscribe patterns, and scaling challenges can also affect system stability.

Four Focus Areas

Partitions

Consumers

Producers

Brokers

Kafka Fundamentals

Kafka stores records (messages) in topics, which are divided into partitions. Each partition has a leader and one or more replicas on follower brokers, providing fault tolerance. Consumers read from topic partitions, often organized into consumer groups that balance load. Lag occurs when consumer processing rates fall behind production rates, and can be estimated by time = messages / (consume_rate - produce_rate).

Best Practices for Partitions

Measure partition data rate : Calculate data rate as average message size × messages per second to size storage and determine required consumer performance.

Use random partitioning unless a specific architecture requires otherwise : Avoid hot partitions that cause bottlenecks and uneven disk usage.

For detailed partition‑selection strategies, see the New Relic article linked in the source.

Best Practices for Consumers

Upgrade any consumer version older than Kafka 0.10 to avoid ZooKeeper‑based coordination bugs and rebalance storms.

Increase socket buffer sizes (e.g., receive.buffer.bytes to 8–16 MB for 10 Gbps networks) to handle high‑throughput data streams.

Design consumers with back‑pressure mechanisms; use fixed‑size off‑heap buffers in JVM environments to prevent excessive garbage collection.

Monitor JVM GC pauses, as long pauses can cause ZooKeeper session loss and trigger unwanted rebalances.

Best Practices for Producers

Configure acknowledgments ( acks or request.required.acks) to ensure messages are persisted before returning success.

Set a high retries value (potentially Integer.MAX_VALUE) for zero‑tolerance loss scenarios.

Tune buffer.memory and batch.size based on producer data rate, partition count, and available memory.

Avoid overly large buffers that increase GC pressure if a producer stalls.

Instrument applications to track metrics such as messages produced, average message size, and buffer usage.

Best Practices for Brokers

Enable topic compression and tune log.cleaner.dedupe.buffer.size and log.cleaner.threads to manage CPU and memory usage.

Monitor network throughput (TX/RX), disk I/O, disk space, and CPU utilization; plan capacity accordingly.

Distribute partition leaders across brokers to avoid overloading a single leader; remember a leader typically consumes four times the network I/O of a follower.

Watch for ISR shrinkage, under‑replicated partitions, and unpreferred leaders as early warning signs.

Adjust Apache Log4j settings for broker logging; retain logs for post‑mortem analysis while managing disk consumption.

Disable automatic topic creation and implement cleanup policies for unused topics.

Provide sufficient memory for high‑throughput brokers to keep data in OS cache and reduce disk reads.

Isolate high‑SLO topics onto dedicated broker subsets to limit blast‑radius of failures.

When using older clients, employ a format‑conversion service on brokers to handle newer message formats.

Avoid assuming that local‑host testing reflects production performance; replication factor and network latency differ significantly.

Conclusion

Applying these 20 recommendations can help teams operate Kafka clusters more efficiently and reliably. For deeper operational guidance, consult the Kafka documentation’s “Operations” section and the referenced New Relic engineering blog.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Kafkaperformance tuningHigh ThroughputPartitionsBrokersConsumersProducers
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.