How We Boosted Kafka Throughput by 35% with Filebeat Tuning and Compression Tricks

This case study details how a high‑traffic Kafka logging cluster was optimized by analyzing low compression ratios, tuning Filebeat parameters, adjusting memory queues and round‑robin settings, and validating the changes through gray‑scale tests, resulting in up to 35% higher throughput and significant resource savings.

ITPUB
ITPUB
ITPUB
How We Boosted Kafka Throughput by 35% with Filebeat Tuning and Compression Tricks

Background

The Kafka cluster that stores all application logs processes traffic at TP‑level per minute, ingests petabytes of data daily and handles more than one trillion messages per day.

Terminology

Broker : a node in the Kafka cluster.

Network idle rate : the average idle proportion of the network thread‑pool threads. Values below 0.3 indicate a saturated cluster that can cause production loss and consumer backlog.

Request queue : the queue where client produce/consume requests are buffered before the server processes them.

Problem Statement and Optimization Goal

During holiday traffic peaks the cluster’s I/O, storage usage and network idle rate surge, slowing server responses and causing consumer lag. The low compression ratio leads to high disk utilization (average 70 %, peak 87 %) and a sharp increase in client request volume. The goal was to raise the compression batch size and reduce request count, traffic and CPU consumption without affecting business read/write latency.

Investigation and Tuning

Filebeat bulk parameters

bulk_flush_frequency: 0   // default, no wait
bulk_max_size: 2048      // default batch size

Increasing bulk_flush_frequency from 0.1 s to 0.2 s produced negligible improvement.

Filebeat memory queue

queue.mem.events: 4096
flush.min_events: 2048
flush.timeout: 1s

Setting flush.timeout to 5 s improved compression when the number of partitions was low, but the benefit disappeared at production scale.

Round‑robin group_events

The round_robin.group_events parameter defines how many events are sent to the same partition before the partitioner switches. Raising this value aggregates more events into a single batch, enlarges batch size and improves the Snappy compression ratio.

Snappy compression test

Compressed size was measured for batches of 1 KB, 5 KB, 20 KB, 50 KB, 100 KB and 200 KB messages. Compression efficiency increased markedly after 50 messages, with the compressed size roughly doubling as batch size grew.

Validation and Rollout

Gray‑scale tests on representative topics showed:

Production request count reduced by ~30 %.

Traffic reduced by 30‑40 %.

After full rollout the cluster exhibited:

CPU usage dropped from 36 % to 22 %.

Topic production request count fell by 42 %.

Overall traffic decreased by 20 %.

Performance Impact

Per‑minute client request volume decreased by over one billion, traffic fell by 35 %, and the cluster’s maximum throughput rose from 26 billion to 33 billion messages per minute – a capacity increase of roughly 35 %.

Future Work

As business volume grows, Kafka data will continue to expand, increasing pressure on I/O, network, storage and CPU. Ongoing work will balance compression ratio against latency, refine monitoring metrics, and fine‑tune bulk_flush_frequency, bulk_max_size, flush.timeout and round_robin.group_events to sustain capacity while keeping data delay within acceptable limits.

Architecture diagram
Architecture diagram
Memory queue diagram
Memory queue diagram
CPU reduction chart
CPU reduction chart
Request reduction chart
Request reduction chart
Traffic reduction chart
Traffic reduction chart
Overall optimization effect
Overall optimization effect
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringKafkacompressionperformance-optimizationFilebeat
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.