How Meituan Scales Kafka to 7,500 Nodes: Real-World Optimizations and Lessons
Meituan’s data platform runs Kafka on over 7,500 machines, handling daily traffic exceeding 21 PB, and tackles latency, slow nodes, and massive cluster management through layered optimizations—including disk balancing, pipeline acceleration, fetcher isolation, cgroup isolation, SSD caching, isolation strategies, full‑link monitoring, lifecycle management, and TOR disaster recovery.
Scale and Challenges
Meituan’s data platform runs Kafka on >7,500 servers, with the largest single cluster containing 1,500 machines. Daily traffic exceeds 21 PB and 11.3 trillion records, peaking at 2.46 × 10⁸ messages / s. Kafka serves as a storage layer that caches and distributes binlog, user‑behavior, and business logs to downstream systems (ODS, real‑time analytics, DataLink, OLAP).
Two critical challenges arise at this scale:
Slow nodes – brokers whose 99th‑percentile read/write latency (tp99) exceeds 300 ms. Causes include uneven disk usage (hotspots), insufficient page‑cache capacity, and consumer thread‑model defects that distort latency metrics.
Cluster‑level operational complexity – traffic bursts across topics, incomplete broker‑level metrics, delayed fault detection, and rack‑level failures that can render partitions unavailable.
Read/Write Latency Optimizations
Application‑Layer Improvements
Disk balancing – a rebalancer component monitors broker disk usage, generates partition‑migration plans, and submits them to Zookeeper’s reassign node. Migration proceeds in batches.
Pipeline acceleration – allows later partitions to be submitted while an earlier one is stalled, eliminating long‑tail migration delays.
Migration cancellation – an admin command aborts long‑running migrations, preventing page‑cache pollution and enabling topic expansion during peak traffic.
Fetcher isolation – separates fetchers for ISR followers and non‑ISR followers so that heavy back‑track reads do not degrade real‑time reads.
Consumer Asynchrony
Kafka’s native NIO consumer uses a single thread, causing stalls when one broker responds faster than others. An asynchronous fetch thread continuously pulls ready data into a CompleteQueue, decoupling request issuance from consumption and limiting concurrent partitions to avoid GC/OOM pressure.
System‑Layer Enhancements
RAID‑0 cache cards – add hardware cache and merge small random writes into larger sequential blocks before hitting HDD, improving random‑write latency.
cgroup & CPU isolation – dedicate physical cores to Kafka and bind all hyper‑threads to the same NUMA node, preventing CPU‑intensive Flink/Storm jobs from contending for cache and eliminating cross‑NUMA latency.
SSD‑Based Tiered Caching Architecture
Two design options were evaluated: Direct‑IO (unsupported by Java) and an SSD tier between memory and HDD. The chosen architecture stores recent log segments on SSD ("only‑cache" state) while older segments reside on HDD ("cached" or "withoutCache" states). Background threads asynchronously sync SSD data to HDD; SSD space is reclaimed based on age thresholds. Replicas may optionally write to SSD, and reads from HDD never write back to SSD, avoiding cache pollution.
Large‑Scale Cluster Management Optimizations
Isolation strategy – business‑level isolation (separate clusters per major product), role‑level isolation (dedicated brokers, controllers, Zookeeper nodes), and priority‑level isolation (VIP clusters with extra redundancy).
Full‑link monitoring – collects fine‑grained metrics from all Kafka components, builds a unified dashboard, and automatically detects slow nodes or rack‑level failures. Some faults are auto‑handled via Kafka‑Manager events; others (e.g., zombie ZK nodes) trigger manual alerts.
Service lifecycle management – an automated state machine links service and machine lifecycles, eliminating ambiguous status semantics and preventing manual mis‑operations.
TOR disaster recovery – improves replica placement to ensure replicas of a partition are spread across different racks, guaranteeing partition availability even when an entire rack fails.
Future Outlook
Planned work includes:
Higher availability via client‑side fault avoidance and multi‑queue server isolation.
Quorum‑write optimization to strengthen durability (current ack=1 writes to all replicas).
Unified stream‑batch storage architecture such as “Kafka on HDFS”.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
