Big Data 25 min read

Optimizing Kafka for Large-Scale Data Platforms at Meituan

The article details Meituan's massive Kafka deployment—over 15,000 machines handling more than 30 PB of daily data—its performance and management challenges, and the comprehensive application‑layer, system‑layer, and hybrid‑layer optimizations Meituan implemented to reduce read/write latency and improve large‑scale cluster reliability.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
Optimizing Kafka for Large-Scale Data Platforms at Meituan

1. Current State and Challenges

Kafka serves as the unified data cache and distribution layer in Meituan's data platform, ingesting logs from system, client, and business databases and forwarding them to downstream consumers such as ODS warehouses, real‑time compute jobs, DataLink sync, and OLAP analysis.

The cluster comprises more than 15,000 machines, with a single cluster reaching over 2,000 machines. Daily message volume exceeds 30 PB, with peak throughput of over 400 million messages per second. As the cluster grew, two major challenge categories emerged.

1.1 Slow Nodes Impacting Read/Write

Slow nodes are defined as brokers whose 99th‑percentile read/write latency (TP99) exceeds 300 ms. Causes include:

Load imbalance leading to hot disks or I/O utilization spikes on a subset of disks.

Insufficient PageCache capacity (e.g., an 80 GB cache can hold only ~8 minutes of data at 170 MB/s), causing cache misses for older data.

Consumer client thread‑model defects: when multiple partitions of a consumer reside on the same broker, TP90 may stay below 100 ms, but spreading partitions across different brokers can push TP90 above 1 s.

1.2 Complexity of Large‑Scale Cluster Management

Four problem classes were identified:

Inter‑topic interference—traffic spikes on one topic or back‑fill reads on a consumer can destabilize the whole cluster.

Broker‑level metrics are insufficient for root‑cause analysis.

Fault detection is slow and remediation costly.

Rack‑level failures can render partitions unavailable.

2. Read/Write Latency Optimizations

Optimizations are divided into application‑layer and system‑layer improvements, followed by a hybrid SSD cache architecture.

2.1 Overview

The diagram (Fig 2‑1) maps latency‑affecting factors to corresponding solutions.

2.2 Application Layer

① Disk Balancing

Imbalanced disk usage creates hot spots that raise real‑time latency (TP99 > 300 ms) and waste cluster capacity. Meituan introduced a Rebalancer that generates migration plans based on target and current disk usage reported by Kafka Monitor, submits the plan to Zookeeper’s Reassign node, and lets the Controller propagate the plan to all brokers. The Rebalancer then monitors execution, moving partitions from overloaded disks to under‑utilized ones to achieve a balanced three‑partitions‑per‑disk state.

② Migration Optimization

Pipeline acceleration: allows new partition migration batches to be submitted even while a previous batch is blocked, reducing overall migration time.

Migration cancel: provides an admin command to abort ongoing migrations, preventing PageCache pollution and allowing partition expansion to complete.

Fetcher isolation: separates ISR followers from non‑ISR followers into distinct fetcher threads, so heavy back‑fill reads do not slow down real‑time reads.

③ Consumer Asynchrony

The native Kafka consumer uses a single NIO thread, causing response‑time growth when some partitions become ready earlier than others. Meituan added an asynchronous fetch thread that continuously pulls ready data into a CompleteQueue, limiting concurrent fetches to avoid GC and OOM, and thus decoupling request latency from the single‑thread bottleneck.

2.3 System Layer

① RAID Card Acceleration

HDD random‑write performance is poor. By inserting a RAID controller with its own cache, write requests are merged into larger blocks, fully utilizing HDD sequential bandwidth and preserving random‑write performance.

② Cgroup Isolation

Kafka (IO‑intensive) and Flink/Storm (CPU‑intensive) are co‑located. Physical‑core contention and cross‑NUMA latency (≈40 ns) caused Kafka latency spikes when CPU load surged. Meituan’s new isolation policy gives Kafka exclusive cores and keeps all its hyper‑threads on the same NUMA node, eliminating cross‑NUMA delays and CPU‑core competition.

2.4 Hybrid Layer – SSD New Cache Architecture

When PageCache is exhausted, ZeroCopy falls back to disk reads, polluting the cache and slowing both back‑fill and real‑time consumers.

Two design families were evaluated:

OS‑level cache (DirectIO not supported in Java; OpenCAS/FlashCache) – transparent but does not fully address Kafka’s access pattern.

Application‑level tiered storage – hot data on SSD, cold data on HDD, with explicit offset‑based reads.

Meituan chose the latter: logs are written to SSD for recent segments and to HDD for older segments. Each segment can be in one of three states: Only‑Cache (on SSD only), Cached (present on both SSD and HDD), or Without‑Cache (only on HDD). Implementation steps:

Place segments on devices according to their age.

Run a background thread to sync SSD data to HDD.

When SSD usage reaches a threshold, delete the oldest segments.

Replica configuration can enable or disable SSD writes per availability needs.

Reads from HDD never write back to SSD, preventing cache pollution.

Additional refinements include syncing only inactive segments and rate‑limiting SSD‑to‑HDD transfers to protect overall I/O.

3. Large‑Scale Cluster Management Optimizations

3.1 Isolation Strategy

Three‑dimensional isolation:

Business isolation – each major business (e.g., delivery, in‑store, selection) runs on an independent Kafka cluster.

Role isolation – brokers, controllers, and Zookeeper are deployed on separate machines.

Priority isolation – high‑availability topics are placed in VIP clusters with extra resource redundancy.

3.2 Full‑Link Monitoring

Broker‑level latency metrics (TP99/TP999) no longer reflect end‑to‑end user experience, and fault detection is delayed. Meituan built a full‑link monitoring system that collects metrics and logs from all core components, visualizes the end‑to‑end latency breakdown, and automatically alerts on anomalies. An example shows RemoteTime dominating the latency, indicating data replication as the bottleneck.

3.3 Service Lifecycle Management

To eliminate ambiguous state semantics and manual errors, a lifecycle management mechanism ties service state to machine state, driven solely by automated operations and prohibiting manual changes.

3.4 TOR Disaster Recovery

Replica placement is enforced across different racks so that a rack failure does not render a partition unavailable.

4. Future Outlook

While latency reductions have been substantial, robustness remains a focus. Planned work includes finer‑grained isolation to shrink fault domains, client‑side avoidance of faulty nodes, multi‑queue request isolation, hot decommission, network‑level back‑pressure and rate‑limiting, and reducing the machine footprint (targeting one‑quarter of current machines) without sacrificing traffic capacity. Additionally, Meituan is exploring cloud‑native migration of its streaming storage services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KafkaData PlatformCluster ManagementMeituanFull‑Link MonitoringRead‑Write LatencySSD Cache
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.