Big Data 21 min read

Meituan Kafka at Scale: Challenges and Optimizations for Latency, Cluster Management, and Reliability

This article details Meituan's large‑scale Kafka deployment, describing the current state, performance challenges such as slow nodes and disk imbalance, and the comprehensive optimizations applied—including read/write latency reductions, migration pipelines, fetcher isolation, SSD caching, RAID acceleration, cgroup isolation, full‑link monitoring, service lifecycle management, and TOR disaster recovery—to improve reliability and prepare for future growth.

Java Architect Essentials
Java Architect Essentials
Java Architect Essentials
Meituan Kafka at Scale: Challenges and Optimizations for Latency, Cluster Management, and Reliability

Kafka serves as the unified data cache and distribution layer in Meituan's data platform, handling over 15,000 machines, 2,000‑plus per cluster, and daily traffic exceeding 30 PB with peak rates of 4 × 10⁸ messages per second.

1. Current Status and Challenges Slow nodes (brokers with TP99 > 300 ms) arise from load imbalance, limited PageCache, and consumer thread model defects. Large‑scale cluster management faces topic interference, insufficient broker metrics, delayed fault detection, and rack‑level failures.

2. Read/Write Latency Optimizations The team split the problem into application and system layers. Application‑level fixes include disk load balancing, a three‑step partition migration plan (plan generation, submission, verification), pipeline acceleration, migration cancellation, and fetcher isolation. System‑level improvements address PageCache pollution, HDD random‑read performance, and resource contention via RAID‑card acceleration and cgroup isolation. Consumer async processing introduces a dedicated fetch thread to avoid NIO single‑thread bottlenecks.

3. Large‑Scale Cluster Management Optimizations Isolation strategies separate business domains, broker/controller roles, and priority tiers (VIP clusters). Full‑link monitoring collects end‑to‑end metrics to pinpoint latency sources, while service lifecycle management synchronizes service and machine states, reducing manual errors. TOR disaster recovery ensures replicas of a partition reside on different racks, preserving availability during rack failures.

4. Future Outlook Ongoing work focuses on increasing robustness through finer‑grained isolation, client‑side fault avoidance, hot‑swap capabilities, and network back‑pressure. The team also plans to reduce hardware footprint while maintaining throughput and to explore cloud‑native deployment of the streaming service.

Performance OptimizationBig DataKafkacluster managementLatency ReductionMeituan
Java Architect Essentials
Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.