Big Data 16 min read

20 Proven Kafka Best Practices for High‑Throughput Enterprise Deployments

This article presents 20 practical best‑practice recommendations—from partition sizing and consumer tuning to producer configuration and broker optimization—to help engineers build scalable, reliable, and high‑throughput Apache Kafka clusters for large‑scale applications.

Java Backend Technology
Java Backend Technology
Java Backend Technology
20 Proven Kafka Best Practices for High‑Throughput Enterprise Deployments

Introduction

Apache Kafka is a popular distributed streaming platform used by large companies such as New Relic, Uber, and Square to build scalable, high‑throughput, and highly reliable real‑time data pipelines.

When deployed at scale, Kafka can become complex; consumers may fall behind, data retention limits can affect performance, and high‑throughput publish‑subscribe patterns may strain the system.

This article shares 20 best‑practice recommendations from New Relic for handling high‑throughput Kafka clusters.

Quick Overview of Kafka Concepts and Architecture

Kafka is a distributed messaging system with built‑in redundancy, elasticity, high throughput, and scalability.

Key terminology includes:

Message : a record consisting of a key, a value, and optional headers.

Producer : publishes messages to topics, choosing partitioning strategy (e.g., round‑robin or key‑based).

Broker : a node in a Kafka cluster.

Topic : a category of messages that consumers subscribe to.

Topic Partition : a subdivision of a topic; each partition has an offset and replicas (leader + followers).

Offset : a monotonically increasing integer that uniquely identifies a message within a partition.

Consumer : reads messages from topic partitions.

Consumer group : a logical grouping of consumers that share partition assignments and perform load‑balancing.

Lag : the difference between the latest offset and the offset a consumer has processed.

1. Practices for Partitions

Understand partition data rate : calculate data rate as average message size × messages per second to provision adequate storage and ensure consumer capacity.

Use random partitioning unless required otherwise : avoid hot partitions that can cause consumer bottlenecks, uneven disk usage, and complex leader balancing.

2. Practices for Consumers

Upgrade consumers older than Kafka 0.10 : older versions rely on ZooKeeper for group coordination and suffer from rebalance storms.

Tune socket buffers : increase receive.buffer.bytes (or socket.receive.buffer.bytes) to 8–16 MB for high‑bandwidth networks; set a minimum of 1 MB if memory is limited.

Design high‑throughput consumers with back‑pressure : use fixed‑size buffers, preferably off‑heap, to prevent JVM GC pauses.

Watch for GC impact : long GC pauses can cause ZooKeeper session loss and rebalance storms for both consumers and brokers.

3. Practices for Producers

Configure acknowledgments (acks) : ensure producers know whether messages are persisted to broker partitions.

Set retries appropriately : increase retries (potentially to Integer.MAX_VALUE) for zero‑tolerance data loss scenarios.

Tune buffer.memory and batch.size : adjust based on producer data rate, number of partitions, and available memory.

Monitor producer metrics : track messages produced, average size, and total bytes sent.

4. Practices for Brokers

Compress topics to save memory and CPU : configure log compaction parameters (log.cleaner.dedupe.buffer.size, log.cleaner.threads) wisely.

Monitor network throughput, disk I/O, and CPU usage : plan capacity to maintain overall cluster performance.

Balance leader distribution : ensure leaders have sufficient network resources; a leader typically uses four times the network I/O of a follower.

Watch ISR shrinks, under‑replicated partitions, and unpreferred leaders : these indicate potential performance bottlenecks.

Adjust Log4j settings : keep broker logs for post‑mortem analysis while managing disk consumption.

Disable automatic topic creation or set cleanup policies : remove unused topics after a defined idle period.

Provide enough memory for high‑throughput brokers : avoid excessive disk reads by keeping data in OS cache.

Isolate high‑throughput workloads across broker subsets : separate topics for different services (e.g., OLTP) to limit impact of failures.

Use newer message formats on old clients via conversion services .

Avoid testing only on single‑node clusters : replication factor of 1 and loopback interfaces do not reflect production performance.

Conclusion

Applying the above recommendations can help you operate Kafka more effectively. For deeper knowledge, refer to the “Operations” section of the official Kafka documentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

High ThroughputApache KafkaPartitionsBrokersConsumersProducers
Java Backend Technology
Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.