Big Data 21 min read

Meituan Kafka at Scale: Challenges and Optimizations for Latency, Cluster Management, and Reliability

This article details Meituan's large‑scale Kafka deployment, describing the current state, performance challenges such as slow nodes and disk imbalance, and the comprehensive optimizations applied—including read/write latency reductions, migration pipelines, fetcher isolation, SSD caching, RAID acceleration, cgroup isolation, full‑link monitoring, service lifecycle management, and TOR disaster recovery—to improve reliability and prepare for future growth.

Java Architect Essentials

Nov 11, 2022

Meituan Kafka at Scale: Challenges and Optimizations for Latency, Cluster Management, and Reliability

Kafka serves as the unified data cache and distribution layer in Meituan's data platform, handling over 15,000 machines, 2,000‑plus per cluster, and daily traffic exceeding 30 PB with peak rates of 4 × 10⁸ messages per second.

1. Current Status and Challenges Slow nodes (brokers with TP99 > 300 ms) arise from load imbalance, limited PageCache, and consumer thread model defects. Large‑scale cluster management faces topic interference, insufficient broker metrics, delayed fault detection, and rack‑level failures.

2. Read/Write Latency Optimizations The team split the problem into application and system layers. Application‑level fixes include disk load balancing, a three‑step partition migration plan (plan generation, submission, verification), pipeline acceleration, migration cancellation, and fetcher isolation. System‑level improvements address PageCache pollution, HDD random‑read performance, and resource contention via RAID‑card acceleration and cgroup isolation. Consumer async processing introduces a dedicated fetch thread to avoid NIO single‑thread bottlenecks.

3. Large‑Scale Cluster Management Optimizations Isolation strategies separate business domains, broker/controller roles, and priority tiers (VIP clusters). Full‑link monitoring collects end‑to‑end metrics to pinpoint latency sources, while service lifecycle management synchronizes service and machine states, reducing manual errors. TOR disaster recovery ensures replicas of a partition reside on different racks, preserving availability during rack failures.

4. Future Outlook Ongoing work focuses on increasing robustness through finer‑grained isolation, client‑side fault avoidance, hot‑swap capabilities, and network back‑pressure. The team also plans to reduce hardware footprint while maintaining throughput and to explore cloud‑native deployment of the streaming service.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kafka cluster management Latency Reduction Meituan

Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.