Optimizing Kafka at Meituan: Challenges and Solutions for a Large‑Scale Data Platform
This article details Meituan's use of Kafka as a unified data cache and distribution layer, outlines the challenges of massive scale and latency, and presents comprehensive optimizations across application, system, and cluster management layers, including disk balancing, migration acceleration, fetcher isolation, and full‑link monitoring.
Introduction
Kafka serves as the unified data cache and distribution component in Meituan's data platform. With rapid data growth and expanding cluster size, Kafka faces increasingly severe challenges.
1. Current Status and Challenges
Meituan's Kafka cluster exceeds 15,000 machines, with single clusters reaching over 2,000 nodes. Daily message volume surpasses 30 PB, peaking at over 400 million messages per second. The main challenges are slow nodes affecting read/write latency and the complexity of managing a massive cluster.
2. Read/Write Latency Optimizations
2.1 Overview
Latency issues are addressed at both the application and system layers. Solutions include pipeline acceleration, fetcher isolation, migration cancellation, and consumer asynchronous processing.
2.2 Application Layer
Disk imbalance: generate and submit rebalancing plans via a Rebalancer component to evenly distribute partitions.
Migration optimization: pipeline acceleration, migration cancellation, and fetcher isolation to prevent migration from blocking real‑time reads.
Consumer asynchronous model: introduce an async fetch thread to pull data continuously, avoiding the single‑thread bottleneck of the native Kafka consumer.
2.3 System Layer
PageCache pollution mitigation.
HDD random‑read performance improvement using RAID cards.
Cgroup isolation to avoid resource contention between I/O‑intensive Kafka and CPU‑intensive Flink/Storm workloads.
SSD‑based cache architecture to separate hot data on SSD from cold data on HDD, preventing cache pollution.
3. Large‑Scale Cluster Management Optimizations
3.1 Isolation Strategy
Business isolation: separate Kafka clusters per major business (e.g., delivery, in‑store, selection).
Role isolation: deploy brokers, controllers, and Zookeeper on distinct machines.
Priority isolation: assign VIP clusters to high‑availability topics.
3.2 Full‑Link Monitoring
Collect and monitor metrics from all Kafka components. When a client request slows, the full‑link view pinpoints the bottleneck (e.g., RemoteTime, RequestQueueTime) and enables rapid fault detection.
4. Service Lifecycle Management
Introduce a lifecycle management mechanism that ties service state to machine state, automating status changes and preventing manual errors.
5. TOR Disaster Recovery
Ensure replicas of the same partition reside on different racks, guaranteeing availability even if an entire rack fails.
Future Outlook
Future work focuses on improving robustness through finer‑grained isolation, client‑side fault avoidance, multi‑queue request segregation, hot‑swap capabilities, and cloud‑native deployment of Kafka.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.