Big Data 18 min read

How Meituan Scales Real‑Time Computing with Flink: Architecture, Challenges & Solutions

This article summarizes Meituan’s real‑time computing platform, detailing its layered architecture built on Kafka, Flink on YARN, state management, resource isolation, fault tolerance, monitoring, and the Petra metric aggregation system, while highlighting the challenges faced and the solutions implemented to achieve high‑throughput, low‑latency stream processing at massive scale.

21CTO

Oct 19, 2018

How Meituan Scales Real‑Time Computing with Flink: Architecture, Challenges & Solutions

Meituan Real‑Time Computing Platform Overview

The platform consists of a data‑cache layer where all log data is collected via a unified log collection system into Kafka, which serves as the central data‑transfer hub supporting both offline and real‑time workloads.

Above the cache layer sits the engine layer offering real‑time computation engines such as Storm (standalone) and Flink (running on YARN). Supporting services include real‑time storage (HBase, Redis, Elasticsearch) for intermediate states, results, and dimension data.

The top layer provides diverse tools for data developers, including job hosting, tuning, diagnostics, monitoring, alerting, data search, and permission management. Meituan is also building a metadata center to store schemas and meta‑information, serving as the “brain” of the real‑time system.

Current Platform Status

Meituan now runs nearly ten thousand jobs on a cluster with thousands of nodes, processing trillions of messages per day and peaking at tens of millions of messages per second.

Pain Points and Problems

Precision issues with Storm’s at‑least‑once semantics and its batch‑oriented Trident state management, leading to latency bottlenecks.

State management challenges affecting consistency, performance, and fault recovery.

Limited expressive power for real‑time computations, especially for exact calculations and windowed operations.

High development and debugging costs on a large‑scale distributed engine.

Flink Exploration Focus

Exactly‑once computation capability.

Advanced state management.

Window/Join/time handling.

SQL/Table API support.

Flink Practice at Meituan

Stability Practices

Resource Isolation

Isolation is applied per scenario and business, considering peak periods, reliability, latency requirements, and application importance. Strategies include YARN labeling with physical node isolation and separating offline DataNode resources from real‑time compute nodes.

Intelligent Scheduling

Beyond CPU and memory, scheduling also accounts for disk I/O, disk capacity, and network bandwidth to avoid resource contention and improve overall efficiency.

Fault Tolerance

Implemented JobManager HA and automatic job restart. Added checkpoint‑based recovery and Kafka consumer retry mechanisms to handle node/network failures and leader switches.

Disaster Recovery

Multi‑data‑center deployment.

Hot standby for Kafka streams.

Platformization

Job Management

A unified UI allows users to submit jobs, configure lifecycle settings, set alerts, and monitor latency, all integrated into the real‑time platform.

Monitoring & Alerting

Enhanced monitoring includes latency alerts, job status alerts, and customizable metric‑based alerts.

Optimization & Diagnosis

Unified logging and metric collection for all engines.

Conditional log search for developers.

Custom time‑range metric queries.

Configurable alerts based on logs and metrics.

Ecosystem Construction

Adaptations for different MQs ensure minimal impact on online services, with configurations for permission management and metric collection.

Flink Applications at Meituan

Petra Real‑Time Metric Aggregation

Petra aggregates business‑time metrics across multiple dimensions (application, channel, data center) with sub‑second latency, supporting composite indicators such as transaction success rate.

Key technical considerations include:

Exactly‑once guarantees via Flink checkpointing.

Mitigating data skew through hotspot key hashing.

Handling late‑arriving data with configurable lateness and window settings.

MLX Machine Learning Platform

Supports both offline and near‑real‑time feature extraction for machine‑learning models, feeding features into training clusters for search, recommendation, and other use cases.

Future Outlook

Planned improvements focus on unified state management (SQL‑based), performance optimization for large state (especially RocksDB backend), and deeper SQL support (concurrency, query optimization). Additionally, Meituan aims to integrate batch and stream processing via a unified SQL API.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Big Data Flink Real-time Streaming State Management Resource Isolation

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.