Big Data 14 min read

How Qunar Travel Cut 2000 CPU Cores by Optimizing Kafka Production

This case study details how Qunar Travel's engineering team analyzed Kafka production bottlenecks during peak traffic, added targeted monitoring, tuned thread and batch parameters, and validated the changes through gray‑scale tests, ultimately saving about 2000 CPU cores across three clusters while reducing request volume and improving network and disk utilization.

ITPUB
ITPUB
ITPUB
How Qunar Travel Cut 2000 CPU Cores by Optimizing Kafka Production

Background

Qunar Travel runs a Kafka log cluster with 145 nodes, each equipped with a 3 TB SSD, 40 CPU cores, and 128 GB RAM. During the Chinese New Year peak the cluster processes 1.3 TB per minute, 20 billion messages per minute, and 1.5 PB of data daily.

Terminology

Broker – a Kafka node.

Network idle rate – proportion of idle threads in the network thread pool; a value of 1 means completely idle, while below 0.3 indicates a performance bottleneck.

Request queue – client requests waiting to be processed.

Kubernetes terms

Kubernetes – container orchestration platform.

Pod – the smallest scheduling unit; may contain one or more containers. In this case filebeat runs as a sidecar container inside the pod.

Production pain points

Pain point 1: During the 2023 New Year stress test the cluster could not keep up; some clients experienced consumption backlog and production failures.

Pain point 2: Network idle rate dropped below 0.4, with many machines reaching an idle rate close to 0, indicating the cluster had hit its performance ceiling.

Optimization process

Adding nodes would be costly, and hardware metrics were not saturated, so the team focused on Kafka‑level tuning.

Log inspection

No obvious errors appeared in server logs for machines with low idle rates.

Hardware inspection

Metrics such as network traffic, disk I/O, memory, and CPU usage were all below their limits; CPU usage even decreased during the test, confirming a service‑level issue.

Parameter tuning

Increased num.io.threads from 32 to 128; num.network.threads remained at 64. Adjusting log.flush.interval.messages and log.flush.interval.ms did not affect the idle rate.

Monitoring expansion

Added JMX metrics: RequestQueueSize, ResponseQueueSize, log‑flush duration, broker produce request count, P99 produce/fetch latency, total fetch/produce requests per second.

Filebeat batch parameters

Identified two key filebeat settings: bulk_flush_frequency (wait time before sending a batch) and bulk_max_size (maximum messages per batch). Tested combinations and found the optimal values bulk_flush_frequency=0.1 and bulk_max_size=1024, reducing produce request count by roughly tenfold.

Test results

With the new parameters, request‑queue size and log‑flush time decreased, network traffic lowered, and CPU usage dropped. SNAPPY compression tests showed that batch sizes above 50 kept the compression ratio stable, further saving network and disk bandwidth.

Verification and rollout

Gray‑scale tests on representative topics showed produce request count fell to one‑tenth of the original, idle rate increased by 0.02‑0.06, and partition‑CPU usage dropped about 5 %.

After full release, average CPU usage fell from 55 % to 32 % (saving 1334 cores), network idle rate rose from 0.72 to 0.93, client request volume dropped from 6 billion/min to 2.3 billion/min, disk usage fell from 44 % to 35 %, and network traffic decreased from 2.5 G to 2 G.

Overall impact

Across three Kafka clusters the optimizations saved roughly 2000 CPU cores (log cluster 1334, databus 470, PUB 206). Remaining bottlenecks are now network and disk; plans include adding additional disks and upgrading NICs to further boost performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringPerformance OptimizationBig DataKubernetesKafkaCPU Savings
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.