20 Proven Kafka Best Practices for High‑Throughput Enterprise Deployments
This article presents 20 practical best‑practice recommendations—from partition sizing and consumer tuning to producer configuration and broker optimization—to help engineers build scalable, reliable, and high‑throughput Apache Kafka clusters for large‑scale applications.
Introduction
Apache Kafka is a popular distributed streaming platform used by large companies such as New Relic, Uber, and Square to build scalable, high‑throughput, and highly reliable real‑time data pipelines.
When deployed at scale, Kafka can become complex; consumers may fall behind, data retention limits can affect performance, and high‑throughput publish‑subscribe patterns may strain the system.
This article shares 20 best‑practice recommendations from New Relic for handling high‑throughput Kafka clusters.
Quick Overview of Kafka Concepts and Architecture
Kafka is a distributed messaging system with built‑in redundancy, elasticity, high throughput, and scalability.
Key terminology includes:
Message : a record consisting of a key, a value, and optional headers.
Producer : publishes messages to topics, choosing partitioning strategy (e.g., round‑robin or key‑based).
Broker : a node in a Kafka cluster.
Topic : a category of messages that consumers subscribe to.
Topic Partition : a subdivision of a topic; each partition has an offset and replicas (leader + followers).
Offset : a monotonically increasing integer that uniquely identifies a message within a partition.
Consumer : reads messages from topic partitions.
Consumer group : a logical grouping of consumers that share partition assignments and perform load‑balancing.
Lag : the difference between the latest offset and the offset a consumer has processed.
1. Practices for Partitions
Understand partition data rate : calculate data rate as average message size × messages per second to provision adequate storage and ensure consumer capacity.
Use random partitioning unless required otherwise : avoid hot partitions that can cause consumer bottlenecks, uneven disk usage, and complex leader balancing.
2. Practices for Consumers
Upgrade consumers older than Kafka 0.10 : older versions rely on ZooKeeper for group coordination and suffer from rebalance storms.
Tune socket buffers : increase receive.buffer.bytes (or socket.receive.buffer.bytes) to 8–16 MB for high‑bandwidth networks; set a minimum of 1 MB if memory is limited.
Design high‑throughput consumers with back‑pressure : use fixed‑size buffers, preferably off‑heap, to prevent JVM GC pauses.
Watch for GC impact : long GC pauses can cause ZooKeeper session loss and rebalance storms for both consumers and brokers.
3. Practices for Producers
Configure acknowledgments (acks) : ensure producers know whether messages are persisted to broker partitions.
Set retries appropriately : increase retries (potentially to Integer.MAX_VALUE) for zero‑tolerance data loss scenarios.
Tune buffer.memory and batch.size : adjust based on producer data rate, number of partitions, and available memory.
Monitor producer metrics : track messages produced, average size, and total bytes sent.
4. Practices for Brokers
Compress topics to save memory and CPU : configure log compaction parameters (log.cleaner.dedupe.buffer.size, log.cleaner.threads) wisely.
Monitor network throughput, disk I/O, and CPU usage : plan capacity to maintain overall cluster performance.
Balance leader distribution : ensure leaders have sufficient network resources; a leader typically uses four times the network I/O of a follower.
Watch ISR shrinks, under‑replicated partitions, and unpreferred leaders : these indicate potential performance bottlenecks.
Adjust Log4j settings : keep broker logs for post‑mortem analysis while managing disk consumption.
Disable automatic topic creation or set cleanup policies : remove unused topics after a defined idle period.
Provide enough memory for high‑throughput brokers : avoid excessive disk reads by keeping data in OS cache.
Isolate high‑throughput workloads across broker subsets : separate topics for different services (e.g., OLTP) to limit impact of failures.
Use newer message formats on old clients via conversion services .
Avoid testing only on single‑node clusters : replication factor of 1 and loopback interfaces do not reflect production performance.
Conclusion
Applying the above recommendations can help you operate Kafka more effectively. For deeper knowledge, refer to the “Operations” section of the official Kafka documentation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
