Big Data 14 min read

How Qunar Cut Kafka CPU Usage by 2000 Cores: A Deep Dive into Performance Optimization

This case study details how Qunar’s engineering team analyzed and resolved severe Kafka production bottlenecks during peak traffic by adding targeted monitoring, tuning thread and filebeat parameters, and validating changes through gray‑scale tests, ultimately saving 2000 CPU cores across three clusters.

dbaplus Community

Jul 23, 2024

How Qunar Cut Kafka CPU Usage by 2000 Cores: A Deep Dive into Performance Optimization

Background

Qunar operates a Kafka logging cluster of 145 nodes, each equipped with a 3 TB SSD, 40 CPU cores and 128 GB RAM. The cluster handles massive traffic—1.3 TB per minute, 20 billion messages per minute, and over 1.5 PB daily (more than 2 trillion messages).

Key Terminology

Broker : a single Kafka node.

Network idle rate : the average idle proportion of the network thread pool; values near 1 indicate low load, while below 0.3 signal a bottleneck.

Request queue : the queue where client requests wait before being processed.

Kubernetes : the container orchestration platform used to run the filebeat sidecar for log collection.

Production Pain Points

During the 2023 Spring Festival stress test, the cluster could not sustain load; some clients experienced consumption backlog and production failures.

Peak periods required manual load‑balancing because the network idle rate dropped below 0.4, indicating severe pressure on the service.

Optimization Process

The team first confirmed that adding more nodes would be costly and that hardware resources were not fully saturated, so the focus shifted to Kafka‑level improvements.

1. Log Investigation

No obvious errors were found in server logs, so the issue could not be pinpointed there.

2. Hardware Metrics Review

Network traffic rose on ingress during the test, but egress remained stable.

Disk I/O usage increased but never hit the ceiling.

Memory usage stayed flat.

CPU usage paradoxically decreased, confirming a service‑level bottleneck.

3. Thread Parameter Tuning

The default values num.io.threads=3 and num.network.threads=8 were already increased to num.io.threads=128 and num.network.threads=64, but this yielded no further improvement.

4. Monitoring Expansion

Additional JMX metrics were collected, including:

RequestQueueSize, ResponseQueueSize

Log flush duration

BROKER produce request count

Produce latency P999, FetchFollower latency P999, FetchConsumer latency P999

TotalFetchRequestsPerSec, TotalProduceRequestsPerSec

5. Flush‑Time Verification

Adjusting log.flush.interval.messages and log.flush.interval.ms did not affect the network idle rate.

6. Disk I/O Test

Switching from a single 3 TB SSD to a dual‑disk setup doubled theoretical I/O capacity, yet the idle rate remained unchanged (0.67 → 0.68).

7. Request‑Queue Analysis

When the request queue approached its limit, the network idle rate and CPU usage both dropped, reproducing the production symptom.

8. Filebeat Parameter Optimization

Two filebeat settings were missing: bulk_flush_frequency (default 0, no wait) and bulk_max_size (default 2048). Tests identified an optimal configuration:

bulk_flush_frequency: 0.1</code>
<code>bulk_max_size: 1024

This reduced Kafka topic request counts by roughly tenfold without data loss.

9. Compression Tests

Using SNAPPY compression, batch sizes above 50 messages yielded stable compression ratios, lowering both network traffic and disk usage.

Validation and Rollout

Gray‑scale tests on representative topics showed production request counts dropped to one‑tenth, node idle rates rose by 0.02‑0.06, and CPU usage per node fell by about five percentage points.

After a full rollout, the following improvements were observed:

CPU usage fell from 52% to 34%.

Request volume decreased from 6 billion to 3.5 billion.

Client request count dropped from 6 billion to 2.3 billion.

Disk utilization fell from 44% to 35%.

Network traffic reduced from 2.5 GB/s to 2 GB/s.

Overall, the three Kafka clusters saved a total of 2000 CPU cores.

Future Actions

Disk: add an additional 3 TB SSD to each node (current disks are 3 TB, peak usage reaches 85%).

Network: upgrade NICs from 10 Gbps to 25 Gbps or dual‑channel 20 Gbps, as peak outbound traffic exceeds 4 Gbps per node.

These upgrades are expected to boost cluster performance by more than 50% without changing CPU or memory.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Performance Optimization Big Data jmx filebeat

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.