Big Data 12 min read

Optimizing Kafka Production at Qunar Travel: Reducing CPU Usage by 2000 Cores

This article presents a comprehensive case study of how Qunar Travel identified and resolved Kafka production bottlenecks—through metric monitoring, thread and flush parameter tuning, and Filebeat batch adjustments—resulting in a 2000‑core CPU reduction, higher network idle rates, and lower resource consumption across three clusters.

Qunar Tech Salon

May 20, 2024

Optimizing Kafka Production at Qunar Travel: Reducing CPU Usage by 2000 Cores

Background : Qunar Travel operates a Kafka log cluster with 145 nodes (3 TB SSD, 40 cores, 128 GB RAM) handling massive traffic (1.3 TB/min, 20 billion messages/min, 1.5 PB/day, >2 trillion messages/day). During the 2023 Spring Festival peak the cluster suffered severe performance degradation.

Terminology : Definitions of Kafka broker, network idle rate, request queue, and Kubernetes concepts such as pod and sidecar are provided.

Production Pain Points : (1) Inability to sustain peak load, causing consumer backlog and production failures; (2) Network idle rate dropping below 0.4, indicating thread‑pool saturation.

Investigation : Log analysis showed no errors. Hardware metrics (network, disk, memory, CPU, TCP connections) were within limits, pointing to service‑level bottlenecks.

Parameter Optimization : Adjusted num.io.threads=128 (from 32) and kept num.network.threads=64. Tested log.flush.interval.messages=10000, log.flush.interval.ms=1000 – no impact on idle rate.

Monitoring Enhancements : Added JMX metrics such as RequestQueueSize, ResponseQueueSize, log flush duration, broker produce request count, P99 latency for produce/fetch, TotalProduceRequestsPerSec, etc., to pinpoint bottlenecks.

Filebeat Tuning : Modified bulk_flush_frequency=0.1 and bulk_max_size=1024 (from defaults 0 and 2048) to reduce request volume; achieved ~10× reduction in Kafka topic request count.

Validation & Rollout : Gray‑scale tests on representative topics showed production request count dropped to one‑tenth, idle rate increased by 0.02‑0.06, and CPU usage per node fell by ~5 %. After full rollout, CPU usage fell from 55 % to 32 % (saving 1334 cores) and request volume fell from 6 billion/min to 2.3 billion/min.

Overall Impact : Across three Kafka clusters the optimizations saved roughly 2000 CPU cores, raised average network idle rate from 0.72 to 0.93, lowered average network traffic from 2.5 GB to 2 GB, and decreased average disk usage from 44 % to 35 %.

Future Work : Plan to add additional disks and upgrade network cards to further improve performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Kafka

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.