Operations 13 min read

20 Proven Kafka Best Practices to Scale High‑Throughput Streams

This article presents New Relic’s 20 best‑practice recommendations for Apache Kafka, covering partitions, consumers, producers, and brokers, to help engineers design, configure, and monitor high‑throughput, reliable streaming pipelines at scale.

Programmer DD

May 5, 2019

20 Proven Kafka Best Practices to Scale High‑Throughput Streams

Apache Kafka is a popular distributed streaming platform used by companies such as New Relic, Uber, and Square to build scalable, high‑throughput, and reliable real‑time data pipelines. In production, Kafka clusters can process over 15 million messages per second with an aggregate data rate close to 1 Tbps.

While Kafka simplifies stream processing, large‑scale deployments can become complex: consumers may fall behind, automatic data retention limits can affect performance, and high‑throughput publish‑subscribe patterns can strain the system.

To reduce this complexity, New Relic shares 20 best‑practice guidelines organized into four areas: Partitions, Consumers, Producers, and Brokers.

Best Practices for Partitions

Understand partition data rate – calculate it as average message size multiplied by messages per second; this determines required storage and the minimum performance a consumer must support.

Prefer random partition assignment when writing topics unless architectural constraints dictate otherwise, to avoid hot partitions that cause consumer bottlenecks, uneven disk usage, and complex leader balancing.

Best Practices for Consumers

Upgrade consumers to at least Kafka 0.10; older versions rely on ZooKeeper and suffer from rebalance storms that can cause message loss.

Increase socket buffer sizes (e.g., set receive.buffer.bytes or socket.receive.buffer.bytes to 8–16 MB or higher) to handle high‑throughput networks.

Design consumers with back‑pressure, using fixed‑size buffers (e.g., Disruptor pattern) and off‑heap memory to avoid JVM garbage‑collection pauses.

Monitor and mitigate long GC pauses, which can cause ZooKeeper session loss or prolonged rebalances.

Best Practices for Producers

Configure acknowledgments (acks) so producers know when messages are safely written to broker partitions.

Set retries to a high value (e.g., Integer.MAX_VALUE) for zero‑tolerance data‑loss applications.

Tune buffer memory and batch size based on message size, rate, partition count, and available heap; larger buffers improve throughput but increase GC pressure.

Instrument producers to track metrics such as messages produced, average size, and bytes used.

Best Practices for Brokers

Allocate sufficient memory and CPU for log compression; tune log.cleaner.dedupe.buffer.size and log.cleaner.threads carefully.

Monitor network throughput (TX/RX), disk I/O, disk space, and CPU usage; capacity planning is essential for cluster health.

Distribute partition leaders wisely; leaders handle more network I/O and disk reads than followers.

Watch ISR shrinkage, under‑replicated partitions, and unpreferred leaders as early warning signs.

Adjust Log4j settings rather than disabling broker logging entirely.

Disable automatic topic creation and define retention policies to avoid orphaned metadata.

Provide ample memory for high‑throughput brokers to keep data in OS cache and avoid disk reads.

Consider isolating high‑throughput topics onto separate broker subsets for large clusters.

Use the newer message format on older clients via conversion services when necessary.

Avoid assuming local‑host broker tests reflect production performance, especially with replication factors greater than one.

For deeper knowledge, consult the “Operations” section of the official Kafka documentation and attend Confluent’s online discussions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

high-throughput Apache Kafka

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.