Meituan Kafka at Scale: Challenges and Optimizations for Latency, Cluster Management, and Reliability
This article details Meituan's large‑scale Kafka deployment—over 15,000 machines and petabyte‑level daily traffic—its operational challenges such as slow nodes, load imbalance, and resource contention, and the comprehensive read/write latency, system‑level, and cluster‑management optimizations implemented to improve performance and reliability.
1. Current Situation and Challenges Meituan's Kafka cluster serves as the unified data cache and distribution layer for the data platform, with more than 15,000 machines overall and clusters up to 2,000 machines; daily message volume exceeds 30 PB and peaks at over 400 million messages per second. The rapid growth brings two major challenges: (1) slow nodes that increase read/write latency, caused by load imbalance, insufficient PageCache, and consumer thread‑model defects; (2) the complexity of managing a massive cluster, including topic interference, inadequate broker‑level metrics, delayed fault detection, and rack‑level failures.
2. Read/Write Latency Optimizations
Application Layer
Disk balancing – prioritize idle disks for partition migration to avoid hot spots.
Migration pipeline – generate, submit, and monitor migration plans via a Rebalancer component.
Migration cancel – allow administrators to abort problematic migrations, preventing PageCache pollution and enabling timely partition expansion.
Fetcher isolation – separate fetcher threads for ISR followers and non‑ISR followers to protect real‑time reads from large back‑log reads.
Consumer async – introduce asynchronous fetch threads to continuously pull ready data, limiting concurrency to avoid GC/ OOM.
System Layer
Raid card acceleration – use RAID cache to merge random writes into larger sequential blocks, improving HDD random‑write performance.
Cgroup isolation – dedicate physical cores to Kafka and keep all hyper‑threads on the same NUMA node, eliminating CPU contention with Flink/Storm.
SSD‑based cache architecture – store recent segments on SSD (Only‑Cache, Cached, Without‑Cache states), asynchronously sync to HDD, and avoid SSD‑to‑HDD write‑back that would pollute PageCache.
3. Large‑Scale Cluster Management Optimizations
Isolation strategy – business‑level isolation (separate clusters per business), role isolation (dedicated brokers, controllers, Zookeeper), and priority isolation (VIP clusters for critical topics).
Full‑link monitoring – collect metrics from all Kafka components, visualize end‑to‑end latency breakdown, and quickly locate bottlenecks such as remote replication delays.
Service lifecycle management – automated state transitions from service start to machine retirement, with machine‑service state coupling and prohibition of manual changes.
TOR disaster recovery – ensure replicas of the same partition reside on different racks, guaranteeing availability even when an entire rack fails.
4. Future Outlook The team plans to further improve robustness through finer‑grained isolation, client‑side fault avoidance, multi‑queue request segregation, hot‑swap support, and network‑level back‑pressure. Additionally, they aim to reduce the hardware footprint while maintaining throughput and explore cloud‑native deployment of the streaming storage service.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
