Operations 23 min read

How Meituan Scales Kafka to 7,500 Nodes: Real-World Optimizations and Lessons

Meituan’s data platform runs Kafka on over 7,500 machines, handling daily traffic exceeding 21 PB, and tackles latency, slow nodes, and massive cluster management through layered optimizations—including disk balancing, pipeline acceleration, fetcher isolation, cgroup isolation, SSD caching, isolation strategies, full‑link monitoring, lifecycle management, and TOR disaster recovery.

dbaplus Community

Mar 22, 2022

How Meituan Scales Kafka to 7,500 Nodes: Real-World Optimizations and Lessons

Scale and Challenges

Meituan’s data platform runs Kafka on >7,500 servers, with the largest single cluster containing 1,500 machines. Daily traffic exceeds 21 PB and 11.3 trillion records, peaking at 2.46 × 10⁸ messages / s. Kafka serves as a storage layer that caches and distributes binlog, user‑behavior, and business logs to downstream systems (ODS, real‑time analytics, DataLink, OLAP).

Two critical challenges arise at this scale:

Slow nodes – brokers whose 99th‑percentile read/write latency (tp99) exceeds 300 ms. Causes include uneven disk usage (hotspots), insufficient page‑cache capacity, and consumer thread‑model defects that distort latency metrics.

Cluster‑level operational complexity – traffic bursts across topics, incomplete broker‑level metrics, delayed fault detection, and rack‑level failures that can render partitions unavailable.

Read/Write Latency Optimizations

Application‑Layer Improvements

Disk balancing – a rebalancer component monitors broker disk usage, generates partition‑migration plans, and submits them to Zookeeper’s reassign node. Migration proceeds in batches.

Pipeline acceleration – allows later partitions to be submitted while an earlier one is stalled, eliminating long‑tail migration delays.

Migration cancellation – an admin command aborts long‑running migrations, preventing page‑cache pollution and enabling topic expansion during peak traffic.

Fetcher isolation – separates fetchers for ISR followers and non‑ISR followers so that heavy back‑track reads do not degrade real‑time reads.

Consumer Asynchrony

Kafka’s native NIO consumer uses a single thread, causing stalls when one broker responds faster than others. An asynchronous fetch thread continuously pulls ready data into a CompleteQueue, decoupling request issuance from consumption and limiting concurrent partitions to avoid GC/OOM pressure.

System‑Layer Enhancements

RAID‑0 cache cards – add hardware cache and merge small random writes into larger sequential blocks before hitting HDD, improving random‑write latency.

cgroup & CPU isolation – dedicate physical cores to Kafka and bind all hyper‑threads to the same NUMA node, preventing CPU‑intensive Flink/Storm jobs from contending for cache and eliminating cross‑NUMA latency.

SSD‑Based Tiered Caching Architecture

Two design options were evaluated: Direct‑IO (unsupported by Java) and an SSD tier between memory and HDD. The chosen architecture stores recent log segments on SSD ("only‑cache" state) while older segments reside on HDD ("cached" or "withoutCache" states). Background threads asynchronously sync SSD data to HDD; SSD space is reclaimed based on age thresholds. Replicas may optionally write to SSD, and reads from HDD never write back to SSD, avoiding cache pollution.

Large‑Scale Cluster Management Optimizations

Isolation strategy – business‑level isolation (separate clusters per major product), role‑level isolation (dedicated brokers, controllers, Zookeeper nodes), and priority‑level isolation (VIP clusters with extra redundancy).

Full‑link monitoring – collects fine‑grained metrics from all Kafka components, builds a unified dashboard, and automatically detects slow nodes or rack‑level failures. Some faults are auto‑handled via Kafka‑Manager events; others (e.g., zombie ZK nodes) trigger manual alerts.

Service lifecycle management – an automated state machine links service and machine lifecycles, eliminating ambiguous status semantics and preventing manual mis‑operations.

TOR disaster recovery – improves replica placement to ensure replicas of a partition are spread across different racks, guaranteeing partition availability even when an entire rack fails.

Future Outlook

Planned work includes:

Higher availability via client‑side fault avoidance and multi‑queue server isolation.

Quorum‑write optimization to strengthen durability (current ack=1 writes to all replicas).

Unified stream‑batch storage architecture such as “Kafka on HDFS”.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kafka Data Platform Large-scale systems

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.