How Meituan Optimized Kafka for Massive Scale: Reducing Latency and Managing Clusters
This article details Meituan's real‑world challenges with a 15,000‑node Kafka deployment and explains the application‑layer and system‑layer optimizations—such as disk balancing, migration pipeline acceleration, fetcher isolation, RAID acceleration, cgroup isolation, and an SSD‑based cache—that together dramatically cut read/write latency and simplify large‑scale cluster management.
Preface
Kafka plays a role of unified data cache and distribution in Meituan's data platform. With growing data volume and cluster size, Kafka faces increasing challenges. This article shares the practical challenges and targeted optimizations.
1. Current Situation and Challenges
1.1 Current Situation
Kafka is used as the streaming storage layer, caching and distributing logs from system, client, and business databases to downstream consumers for offline, real‑time, log‑center, and OLAP use. The cluster exceeds 15,000 machines, with single clusters up to 2,000 machines, daily message volume over 30 PB and peak 4 billion messages/second.
1.2 Challenges
Two main challenges:
Slow nodes affecting read/write (TP99 > 300 ms). Causes: load imbalance, PageCache capacity limits, consumer thread‑model defects.
Complexity of managing large‑scale clusters: topic interference, insufficient broker metrics, slow fault detection, rack‑level failures.
2. Read/Write Latency Optimizations
Optimizations are divided into application‑layer and system‑layer.
2.1 Overview
Diagram shows factors and corresponding solutions.
2.2 Application Layer
Issues: broker load imbalance, inefficient data migration, consumer single‑thread model.
Disk balancing via Rebalancer generates migration plans based on target and current disk usage, submits to Zookeeper, and monitors execution.
Migration pipeline acceleration, migration cancellation, and Fetcher isolation to separate migration traffic from real‑time reads.
Consumer asynchronous pulling: introduce async fetch threads, limit concurrency, avoid single‑thread bottleneck.
Pipeline acceleration allows later partitions to be submitted even if an earlier one is blocked.
Migration cancellation prevents long‑tail partitions from blocking read/write requests.
Fetcher isolation separates ISR followers from non‑ISR followers so that real‑time reads are not slowed by large back‑track reads.
Consumer asynchronous pulling introduces a background thread that continuously fetches ready data and places it into a complete queue, reducing latency caused by the native single‑thread NIO model.
2.3 System Layer
Raid card acceleration improves HDD random‑write performance by caching and merging writes into larger blocks.
Cgroup isolation prevents resource contention between IO‑intensive Kafka and CPU‑intensive Flink/Storm, and keeps Kafka on a single NUMA node.
SSD‑based cache architecture stores hot segments on SSD, keeps older data on HDD without writing back to SSD, and defines three segment states (OnlyCache, Cached, WithoutCache). An asynchronous thread syncs SSD data to HDD, and space‑threshold eviction removes the oldest data when SSD fills.
3. Large‑Scale Cluster Management Optimizations
3.1 Isolation Strategies
Business isolation (separate clusters per business), role isolation (separate brokers, controllers, Zookeeper), and priority isolation (VIP clusters for high‑availability topics).
3.2 End‑to‑End Monitoring
Collect metrics and logs from core components, locate bottlenecks via full‑link monitoring, and automate fault detection and handling.
3.3 Service Lifecycle Management
Automated lifecycle management links service and machine states, prevents manual state changes, and ensures consistent status updates.
3.4 TOR Disaster Recovery
Guarantee that replicas of a partition reside in different racks so that a rack failure does not cause data loss.
4. Future Outlook
Future work will focus on improving robustness, narrowing fault domains with finer isolation, enabling client‑side node avoidance, multi‑queue request isolation, hot‑swap, network back‑pressure, and exploring cloud‑native deployment while reducing machine count.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
