Optimizing Kafka Production at Qunar Travel: Reducing CPU Usage by 2000 Cores
This article presents a comprehensive case study of how Qunar Travel identified and resolved Kafka production bottlenecks—through metric monitoring, thread and flush parameter tuning, and Filebeat batch adjustments—resulting in a 2000‑core CPU reduction, higher network idle rates, and lower resource consumption across three clusters.
Background : Qunar Travel operates a Kafka log cluster with 145 nodes (3 TB SSD, 40 cores, 128 GB RAM) handling massive traffic (1.3 TB/min, 20 billion messages/min, 1.5 PB/day, >2 trillion messages/day). During the 2023 Spring Festival peak the cluster suffered severe performance degradation.
Terminology : Definitions of Kafka broker, network idle rate, request queue, and Kubernetes concepts such as pod and sidecar are provided.
Production Pain Points : (1) Inability to sustain peak load, causing consumer backlog and production failures; (2) Network idle rate dropping below 0.4, indicating thread‑pool saturation.
Investigation : Log analysis showed no errors. Hardware metrics (network, disk, memory, CPU, TCP connections) were within limits, pointing to service‑level bottlenecks.
Parameter Optimization : Adjusted num.io.threads=128 (from 32) and kept num.network.threads=64 . Tested log.flush.interval.messages=10000 , log.flush.interval.ms=1000 – no impact on idle rate.
Monitoring Enhancements : Added JMX metrics such as RequestQueueSize, ResponseQueueSize, log flush duration, broker produce request count, P99 latency for produce/fetch, TotalProduceRequestsPerSec, etc., to pinpoint bottlenecks.
Filebeat Tuning : Modified bulk_flush_frequency=0.1 and bulk_max_size=1024 (from defaults 0 and 2048) to reduce request volume; achieved ~10× reduction in Kafka topic request count.
Validation & Rollout : Gray‑scale tests on representative topics showed production request count dropped to one‑tenth, idle rate increased by 0.02‑0.06, and CPU usage per node fell by ~5 %. After full rollout, CPU usage fell from 55 % to 32 % (saving 1334 cores) and request volume fell from 6 billion/min to 2.3 billion/min.
Overall Impact : Across three Kafka clusters the optimizations saved roughly 2000 CPU cores, raised average network idle rate from 0.72 to 0.93, lowered average network traffic from 2.5 GB to 2 GB, and decreased average disk usage from 44 % to 35 %.
Future Work : Plan to add additional disks and upgrade network cards to further improve performance.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.