Optimizing Kafka for Large-Scale Data Platforms at Meituan
The article details Meituan's massive Kafka deployment—over 15,000 machines handling more than 30 PB of daily data—its performance and management challenges, and the comprehensive application‑layer, system‑layer, and hybrid‑layer optimizations Meituan implemented to reduce read/write latency and improve large‑scale cluster reliability.
1. Current State and Challenges
Kafka serves as the unified data cache and distribution layer in Meituan's data platform, ingesting logs from system, client, and business databases and forwarding them to downstream consumers such as ODS warehouses, real‑time compute jobs, DataLink sync, and OLAP analysis.
The cluster comprises more than 15,000 machines, with a single cluster reaching over 2,000 machines. Daily message volume exceeds 30 PB, with peak throughput of over 400 million messages per second. As the cluster grew, two major challenge categories emerged.
1.1 Slow Nodes Impacting Read/Write
Slow nodes are defined as brokers whose 99th‑percentile read/write latency (TP99) exceeds 300 ms. Causes include:
Load imbalance leading to hot disks or I/O utilization spikes on a subset of disks.
Insufficient PageCache capacity (e.g., an 80 GB cache can hold only ~8 minutes of data at 170 MB/s), causing cache misses for older data.
Consumer client thread‑model defects: when multiple partitions of a consumer reside on the same broker, TP90 may stay below 100 ms, but spreading partitions across different brokers can push TP90 above 1 s.
1.2 Complexity of Large‑Scale Cluster Management
Four problem classes were identified:
Inter‑topic interference—traffic spikes on one topic or back‑fill reads on a consumer can destabilize the whole cluster.
Broker‑level metrics are insufficient for root‑cause analysis.
Fault detection is slow and remediation costly.
Rack‑level failures can render partitions unavailable.
2. Read/Write Latency Optimizations
Optimizations are divided into application‑layer and system‑layer improvements, followed by a hybrid SSD cache architecture.
2.1 Overview
The diagram (Fig 2‑1) maps latency‑affecting factors to corresponding solutions.
2.2 Application Layer
① Disk Balancing
Imbalanced disk usage creates hot spots that raise real‑time latency (TP99 > 300 ms) and waste cluster capacity. Meituan introduced a Rebalancer that generates migration plans based on target and current disk usage reported by Kafka Monitor, submits the plan to Zookeeper’s Reassign node, and lets the Controller propagate the plan to all brokers. The Rebalancer then monitors execution, moving partitions from overloaded disks to under‑utilized ones to achieve a balanced three‑partitions‑per‑disk state.
② Migration Optimization
Pipeline acceleration: allows new partition migration batches to be submitted even while a previous batch is blocked, reducing overall migration time.
Migration cancel: provides an admin command to abort ongoing migrations, preventing PageCache pollution and allowing partition expansion to complete.
Fetcher isolation: separates ISR followers from non‑ISR followers into distinct fetcher threads, so heavy back‑fill reads do not slow down real‑time reads.
③ Consumer Asynchrony
The native Kafka consumer uses a single NIO thread, causing response‑time growth when some partitions become ready earlier than others. Meituan added an asynchronous fetch thread that continuously pulls ready data into a CompleteQueue, limiting concurrent fetches to avoid GC and OOM, and thus decoupling request latency from the single‑thread bottleneck.
2.3 System Layer
① RAID Card Acceleration
HDD random‑write performance is poor. By inserting a RAID controller with its own cache, write requests are merged into larger blocks, fully utilizing HDD sequential bandwidth and preserving random‑write performance.
② Cgroup Isolation
Kafka (IO‑intensive) and Flink/Storm (CPU‑intensive) are co‑located. Physical‑core contention and cross‑NUMA latency (≈40 ns) caused Kafka latency spikes when CPU load surged. Meituan’s new isolation policy gives Kafka exclusive cores and keeps all its hyper‑threads on the same NUMA node, eliminating cross‑NUMA delays and CPU‑core competition.
2.4 Hybrid Layer – SSD New Cache Architecture
When PageCache is exhausted, ZeroCopy falls back to disk reads, polluting the cache and slowing both back‑fill and real‑time consumers.
Two design families were evaluated:
OS‑level cache (DirectIO not supported in Java; OpenCAS/FlashCache) – transparent but does not fully address Kafka’s access pattern.
Application‑level tiered storage – hot data on SSD, cold data on HDD, with explicit offset‑based reads.
Meituan chose the latter: logs are written to SSD for recent segments and to HDD for older segments. Each segment can be in one of three states: Only‑Cache (on SSD only), Cached (present on both SSD and HDD), or Without‑Cache (only on HDD). Implementation steps:
Place segments on devices according to their age.
Run a background thread to sync SSD data to HDD.
When SSD usage reaches a threshold, delete the oldest segments.
Replica configuration can enable or disable SSD writes per availability needs.
Reads from HDD never write back to SSD, preventing cache pollution.
Additional refinements include syncing only inactive segments and rate‑limiting SSD‑to‑HDD transfers to protect overall I/O.
3. Large‑Scale Cluster Management Optimizations
3.1 Isolation Strategy
Three‑dimensional isolation:
Business isolation – each major business (e.g., delivery, in‑store, selection) runs on an independent Kafka cluster.
Role isolation – brokers, controllers, and Zookeeper are deployed on separate machines.
Priority isolation – high‑availability topics are placed in VIP clusters with extra resource redundancy.
3.2 Full‑Link Monitoring
Broker‑level latency metrics (TP99/TP999) no longer reflect end‑to‑end user experience, and fault detection is delayed. Meituan built a full‑link monitoring system that collects metrics and logs from all core components, visualizes the end‑to‑end latency breakdown, and automatically alerts on anomalies. An example shows RemoteTime dominating the latency, indicating data replication as the bottleneck.
3.3 Service Lifecycle Management
To eliminate ambiguous state semantics and manual errors, a lifecycle management mechanism ties service state to machine state, driven solely by automated operations and prohibiting manual changes.
3.4 TOR Disaster Recovery
Replica placement is enforced across different racks so that a rack failure does not render a partition unavailable.
4. Future Outlook
While latency reductions have been substantial, robustness remains a focus. Planned work includes finer‑grained isolation to shrink fault domains, client‑side avoidance of faulty nodes, multi‑queue request isolation, hot decommission, network‑level back‑pressure and rate‑limiting, and reducing the machine footprint (targeting one‑quarter of current machines) without sacrificing traffic capacity. Additionally, Meituan is exploring cloud‑native migration of its streaming storage services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
