Operations 23 min read

How Meituan Optimized Kafka for Massive Scale: Reducing Latency and Managing Clusters

This article details Meituan's real‑world challenges with a 15,000‑node Kafka deployment and explains the application‑layer and system‑layer optimizations—such as disk balancing, migration pipeline acceleration, fetcher isolation, RAID acceleration, cgroup isolation, and an SSD‑based cache—that together dramatically cut read/write latency and simplify large‑scale cluster management.

Java High-Performance Architecture
Java High-Performance Architecture
Java High-Performance Architecture
How Meituan Optimized Kafka for Massive Scale: Reducing Latency and Managing Clusters

Preface

Kafka plays a role of unified data cache and distribution in Meituan's data platform. With growing data volume and cluster size, Kafka faces increasing challenges. This article shares the practical challenges and targeted optimizations.

1. Current Situation and Challenges

1.1 Current Situation

Kafka is used as the streaming storage layer, caching and distributing logs from system, client, and business databases to downstream consumers for offline, real‑time, log‑center, and OLAP use. The cluster exceeds 15,000 machines, with single clusters up to 2,000 machines, daily message volume over 30 PB and peak 4 billion messages/second.

1.2 Challenges

Two main challenges:

Slow nodes affecting read/write (TP99 > 300 ms). Causes: load imbalance, PageCache capacity limits, consumer thread‑model defects.

Complexity of managing large‑scale clusters: topic interference, insufficient broker metrics, slow fault detection, rack‑level failures.

2. Read/Write Latency Optimizations

Optimizations are divided into application‑layer and system‑layer.

2.1 Overview

Diagram shows factors and corresponding solutions.

2.2 Application Layer

Issues: broker load imbalance, inefficient data migration, consumer single‑thread model.

Disk balancing via Rebalancer generates migration plans based on target and current disk usage, submits to Zookeeper, and monitors execution.

Migration pipeline acceleration, migration cancellation, and Fetcher isolation to separate migration traffic from real‑time reads.

Consumer asynchronous pulling: introduce async fetch threads, limit concurrency, avoid single‑thread bottleneck.

Pipeline acceleration allows later partitions to be submitted even if an earlier one is blocked.

Migration cancellation prevents long‑tail partitions from blocking read/write requests.

Fetcher isolation separates ISR followers from non‑ISR followers so that real‑time reads are not slowed by large back‑track reads.

Consumer asynchronous pulling introduces a background thread that continuously fetches ready data and places it into a complete queue, reducing latency caused by the native single‑thread NIO model.

2.3 System Layer

Raid card acceleration improves HDD random‑write performance by caching and merging writes into larger blocks.

Cgroup isolation prevents resource contention between IO‑intensive Kafka and CPU‑intensive Flink/Storm, and keeps Kafka on a single NUMA node.

SSD‑based cache architecture stores hot segments on SSD, keeps older data on HDD without writing back to SSD, and defines three segment states (OnlyCache, Cached, WithoutCache). An asynchronous thread syncs SSD data to HDD, and space‑threshold eviction removes the oldest data when SSD fills.

3. Large‑Scale Cluster Management Optimizations

3.1 Isolation Strategies

Business isolation (separate clusters per business), role isolation (separate brokers, controllers, Zookeeper), and priority isolation (VIP clusters for high‑availability topics).

3.2 End‑to‑End Monitoring

Collect metrics and logs from core components, locate bottlenecks via full‑link monitoring, and automate fault detection and handling.

3.3 Service Lifecycle Management

Automated lifecycle management links service and machine states, prevents manual state changes, and ensures consistent status updates.

3.4 TOR Disaster Recovery

Guarantee that replicas of a partition reside in different racks so that a rack failure does not cause data loss.

4. Future Outlook

Future work will focus on improving robustness, narrowing fault domains with finer isolation, enabling client‑side node avoidance, multi‑queue request isolation, hot‑swap, network back‑pressure, and exploring cloud‑native deployment while reducing machine count.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceoptimizationStreaminglarge scaleCluster ManagementMeituan
Java High-Performance Architecture
Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.