How Qunar Cut Kafka CPU Usage by 2000 Cores: A Deep Dive into Performance Optimization
This case study details how Qunar’s engineering team analyzed and resolved severe Kafka production bottlenecks during peak traffic by adding targeted monitoring, tuning thread and filebeat parameters, and validating changes through gray‑scale tests, ultimately saving 2000 CPU cores across three clusters.
Background
Qunar operates a Kafka logging cluster of 145 nodes, each equipped with a 3 TB SSD, 40 CPU cores and 128 GB RAM. The cluster handles massive traffic—1.3 TB per minute, 20 billion messages per minute, and over 1.5 PB daily (more than 2 trillion messages).
Key Terminology
Broker : a single Kafka node.
Network idle rate : the average idle proportion of the network thread pool; values near 1 indicate low load, while below 0.3 signal a bottleneck.
Request queue : the queue where client requests wait before being processed.
Kubernetes : the container orchestration platform used to run the filebeat sidecar for log collection.
Production Pain Points
During the 2023 Spring Festival stress test, the cluster could not sustain load; some clients experienced consumption backlog and production failures.
Peak periods required manual load‑balancing because the network idle rate dropped below 0.4, indicating severe pressure on the service.
Optimization Process
The team first confirmed that adding more nodes would be costly and that hardware resources were not fully saturated, so the focus shifted to Kafka‑level improvements.
1. Log Investigation
No obvious errors were found in server logs, so the issue could not be pinpointed there.
2. Hardware Metrics Review
Network traffic rose on ingress during the test, but egress remained stable.
Disk I/O usage increased but never hit the ceiling.
Memory usage stayed flat.
CPU usage paradoxically decreased, confirming a service‑level bottleneck.
3. Thread Parameter Tuning
The default values num.io.threads=3 and num.network.threads=8 were already increased to num.io.threads=128 and num.network.threads=64, but this yielded no further improvement.
4. Monitoring Expansion
Additional JMX metrics were collected, including:
RequestQueueSize, ResponseQueueSize
Log flush duration
BROKER produce request count
Produce latency P999, FetchFollower latency P999, FetchConsumer latency P999
TotalFetchRequestsPerSec, TotalProduceRequestsPerSec
5. Flush‑Time Verification
Adjusting log.flush.interval.messages and log.flush.interval.ms did not affect the network idle rate.
6. Disk I/O Test
Switching from a single 3 TB SSD to a dual‑disk setup doubled theoretical I/O capacity, yet the idle rate remained unchanged (0.67 → 0.68).
7. Request‑Queue Analysis
When the request queue approached its limit, the network idle rate and CPU usage both dropped, reproducing the production symptom.
8. Filebeat Parameter Optimization
Two filebeat settings were missing: bulk_flush_frequency (default 0, no wait) and bulk_max_size (default 2048). Tests identified an optimal configuration:
bulk_flush_frequency: 0.1</code>
<code>bulk_max_size: 1024This reduced Kafka topic request counts by roughly tenfold without data loss.
9. Compression Tests
Using SNAPPY compression, batch sizes above 50 messages yielded stable compression ratios, lowering both network traffic and disk usage.
Validation and Rollout
Gray‑scale tests on representative topics showed production request counts dropped to one‑tenth, node idle rates rose by 0.02‑0.06, and CPU usage per node fell by about five percentage points.
After a full rollout, the following improvements were observed:
CPU usage fell from 52% to 34%.
Request volume decreased from 6 billion to 3.5 billion.
Client request count dropped from 6 billion to 2.3 billion.
Disk utilization fell from 44% to 35%.
Network traffic reduced from 2.5 GB/s to 2 GB/s.
Overall, the three Kafka clusters saved a total of 2000 CPU cores.
Future Actions
Disk: add an additional 3 TB SSD to each node (current disks are 3 TB, peak usage reaches 85%).
Network: upgrade NICs from 10 Gbps to 25 Gbps or dual‑channel 20 Gbps, as peak outbound traffic exceeds 4 Gbps per node.
These upgrades are expected to boost cluster performance by more than 50% without changing CPU or memory.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
