Meituan Kafka at Scale: Challenges and Optimizations for Latency, Cluster Management, and Reliability

This article details Meituan's large‑scale Kafka deployment—over 15,000 machines and petabyte‑level daily traffic—its operational challenges such as slow nodes, load imbalance, and resource contention, and the comprehensive read/write latency, system‑level, and cluster‑management optimizations implemented to improve performance and reliability.

Architecture Digest
Architecture Digest
Architecture Digest
Meituan Kafka at Scale: Challenges and Optimizations for Latency, Cluster Management, and Reliability

1. Current Situation and Challenges Meituan's Kafka cluster serves as the unified data cache and distribution layer for the data platform, with more than 15,000 machines overall and clusters up to 2,000 machines; daily message volume exceeds 30 PB and peaks at over 400 million messages per second. The rapid growth brings two major challenges: (1) slow nodes that increase read/write latency, caused by load imbalance, insufficient PageCache, and consumer thread‑model defects; (2) the complexity of managing a massive cluster, including topic interference, inadequate broker‑level metrics, delayed fault detection, and rack‑level failures.

2. Read/Write Latency Optimizations

Application Layer

Disk balancing – prioritize idle disks for partition migration to avoid hot spots.

Migration pipeline – generate, submit, and monitor migration plans via a Rebalancer component.

Migration cancel – allow administrators to abort problematic migrations, preventing PageCache pollution and enabling timely partition expansion.

Fetcher isolation – separate fetcher threads for ISR followers and non‑ISR followers to protect real‑time reads from large back‑log reads.

Consumer async – introduce asynchronous fetch threads to continuously pull ready data, limiting concurrency to avoid GC/ OOM.

System Layer

Raid card acceleration – use RAID cache to merge random writes into larger sequential blocks, improving HDD random‑write performance.

Cgroup isolation – dedicate physical cores to Kafka and keep all hyper‑threads on the same NUMA node, eliminating CPU contention with Flink/Storm.

SSD‑based cache architecture – store recent segments on SSD (Only‑Cache, Cached, Without‑Cache states), asynchronously sync to HDD, and avoid SSD‑to‑HDD write‑back that would pollute PageCache.

3. Large‑Scale Cluster Management Optimizations

Isolation strategy – business‑level isolation (separate clusters per business), role isolation (dedicated brokers, controllers, Zookeeper), and priority isolation (VIP clusters for critical topics).

Full‑link monitoring – collect metrics from all Kafka components, visualize end‑to‑end latency breakdown, and quickly locate bottlenecks such as remote replication delays.

Service lifecycle management – automated state transitions from service start to machine retirement, with machine‑service state coupling and prohibition of manual changes.

TOR disaster recovery – ensure replicas of the same partition reside on different racks, guaranteeing availability even when an entire rack fails.

4. Future Outlook The team plans to further improve robustness through finer‑grained isolation, client‑side fault avoidance, multi‑queue request segregation, hot‑swap support, and network‑level back‑pressure. Additionally, they aim to reduce the hardware footprint while maintaining throughput and explore cloud‑native deployment of the streaming storage service.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsPerformance OptimizationKafkaCluster Management
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.