Big Data 22 min read

How Meituan Scaled Kafka to 15k Nodes and Overcame Latency & Management Challenges

Meituan’s data platform uses Kafka as a unified cache and distribution layer, and as the cluster grew beyond 15,000 brokers and daily traffic exceeded 30 PB, the team tackled slow nodes, read/write latency, large‑scale cluster management, full‑link monitoring, service lifecycle, and future cloud‑native deployment challenges.

dbaplus Community

Aug 23, 2022

How Meituan Scaled Kafka to 15k Nodes and Overcame Latency & Management Challenges

Current State and Challenges

Kafka is the streaming storage layer of Meituan’s data platform, caching and distributing logs from system, client and database sources to downstream consumers such as offline ODS, real‑time jobs, DataLink and OLAP analysis. The deployment exceeds 15,000 machines, with single clusters up to 2,000 nodes, daily traffic >30 PB and peak throughput >400 M messages/s.

Key challenges:

Slow brokers (TP99 read/write latency >300 ms) caused by load imbalance, insufficient PageCache, and a consumer thread‑model that mixes real‑time and back‑track reads.

Management complexity at massive scale: topic interference, missing broker‑level metrics, delayed fault detection, and rack‑level failures.

Read/Write Latency Optimizations

Application‑Layer Improvements

Disk balancing : A Rebalancer component continuously computes per‑disk usage, generates partition‑migration plans, submits them to Zookeeper’s Reassign node, and monitors progress.

Pipeline‑style migration : New partitions can be submitted while earlier ones are still in flight, avoiding the “batch‑lock” of the native Kafka migrator.

Migration cancellation : Administrators can abort an ongoing migration, preventing PageCache pollution and allowing immediate partition expansion.

Fetcher isolation : ISR followers share a dedicated fetcher thread, non‑ISR followers use a separate fetcher. This prevents large back‑track reads from blocking low‑latency real‑time reads.

Consumer async fetch : Introduce a pool of asynchronous fetch threads that pull ready data into a CompleteQueue and limit concurrent partitions per thread to avoid GC/OOM spikes.

System‑Layer Improvements

RAID cache acceleration : Use RAID controller cache to coalesce small random writes into larger sequential blocks, improving HDD random‑write performance.

Cgroup & NUMA isolation : Deploy Kafka on dedicated physical cores, pin all hyper‑threads to the same NUMA node, and keep Kafka separate from CPU‑intensive Flink/Storm workloads to eliminate CPU‑memory contention.

SSD‑based tiered storage : Two designs were evaluated – kernel‑level cache (OpenCAS/FlashCache) and a Kafka‑level tiered storage. The latter was chosen because it can explicitly route recent logs to SSD while older logs stay on HDD, avoiding PageCache pollution.

Large‑Scale Cluster Management Optimizations

Isolation strategy :

Business‑level isolation – each major business (e.g., delivery, in‑store, selection) runs in an independent Kafka cluster.

Role isolation – brokers, controllers and Zookeeper are deployed on separate machines.

Priority isolation – VIP clusters host high‑availability topics with extra redundancy.

Full‑link monitoring : Collect metrics and logs from producers, brokers, controllers and consumers. Visualize request latency breakdown (RequestQueueTime, LocalTime, RemoteTime, ResponseTime) to pinpoint bottlenecks quickly.

Service lifecycle management : Automatic state transitions from service start to machine retirement, linking service and machine status. Manual state changes are prohibited; only the automation system can trigger transitions.

TOR disaster recovery : Enforce rack‑aware replica placement so that replicas of the same partition never share a rack, guaranteeing availability after a rack failure.

Future Outlook

Planned work focuses on increasing robustness through finer‑grained isolation (client‑side fault avoidance, multi‑queue request segregation), hot‑swap deployments, and network‑level back‑pressure. Meituan also aims to decouple Kafka from mixed‑workload deployments, targeting the same throughput with roughly one‑quarter of the current machine count, and to explore cloud‑native streaming storage solutions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kafka Read‑Write Latency SSD Caching Large-Scale Cluster

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.