How Meituan Scales Real‑Time Computing with Flink: Architecture, Challenges & Solutions
This article summarizes Meituan’s real‑time computing platform, detailing its layered architecture built on Kafka, Flink on YARN, state management, resource isolation, fault tolerance, monitoring, and the Petra metric aggregation system, while highlighting the challenges faced and the solutions implemented to achieve high‑throughput, low‑latency stream processing at massive scale.
Meituan Real‑Time Computing Platform Overview
The platform consists of a data‑cache layer where all log data is collected via a unified log collection system into Kafka, which serves as the central data‑transfer hub supporting both offline and real‑time workloads.
Above the cache layer sits the engine layer offering real‑time computation engines such as Storm (standalone) and Flink (running on YARN). Supporting services include real‑time storage (HBase, Redis, Elasticsearch) for intermediate states, results, and dimension data.
The top layer provides diverse tools for data developers, including job hosting, tuning, diagnostics, monitoring, alerting, data search, and permission management. Meituan is also building a metadata center to store schemas and meta‑information, serving as the “brain” of the real‑time system.
Current Platform Status
Meituan now runs nearly ten thousand jobs on a cluster with thousands of nodes, processing trillions of messages per day and peaking at tens of millions of messages per second.
Pain Points and Problems
Precision issues with Storm’s at‑least‑once semantics and its batch‑oriented Trident state management, leading to latency bottlenecks.
State management challenges affecting consistency, performance, and fault recovery.
Limited expressive power for real‑time computations, especially for exact calculations and windowed operations.
High development and debugging costs on a large‑scale distributed engine.
Flink Exploration Focus
Exactly‑once computation capability.
Advanced state management.
Window/Join/time handling.
SQL/Table API support.
Flink Practice at Meituan
Stability Practices
Resource Isolation
Isolation is applied per scenario and business, considering peak periods, reliability, latency requirements, and application importance. Strategies include YARN labeling with physical node isolation and separating offline DataNode resources from real‑time compute nodes.
Intelligent Scheduling
Beyond CPU and memory, scheduling also accounts for disk I/O, disk capacity, and network bandwidth to avoid resource contention and improve overall efficiency.
Fault Tolerance
Implemented JobManager HA and automatic job restart. Added checkpoint‑based recovery and Kafka consumer retry mechanisms to handle node/network failures and leader switches.
Disaster Recovery
Multi‑data‑center deployment.
Hot standby for Kafka streams.
Platformization
Job Management
A unified UI allows users to submit jobs, configure lifecycle settings, set alerts, and monitor latency, all integrated into the real‑time platform.
Monitoring & Alerting
Enhanced monitoring includes latency alerts, job status alerts, and customizable metric‑based alerts.
Optimization & Diagnosis
Unified logging and metric collection for all engines.
Conditional log search for developers.
Custom time‑range metric queries.
Configurable alerts based on logs and metrics.
Ecosystem Construction
Adaptations for different MQs ensure minimal impact on online services, with configurations for permission management and metric collection.
Flink Applications at Meituan
Petra Real‑Time Metric Aggregation
Petra aggregates business‑time metrics across multiple dimensions (application, channel, data center) with sub‑second latency, supporting composite indicators such as transaction success rate.
Key technical considerations include:
Exactly‑once guarantees via Flink checkpointing.
Mitigating data skew through hotspot key hashing.
Handling late‑arriving data with configurable lateness and window settings.
MLX Machine Learning Platform
Supports both offline and near‑real‑time feature extraction for machine‑learning models, feeding features into training clusters for search, recommendation, and other use cases.
Future Outlook
Planned improvements focus on unified state management (SQL‑based), performance optimization for large state (especially RocksDB backend), and deeper SQL support (concurrency, query optimization). Additionally, Meituan aims to integrate batch and stream processing via a unified SQL API.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
