Big Data 19 min read

Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications

This article describes how Meituan migrated from Hive‑SQL and MapReduce to Spark on YARN, built an interactive Zeppelin‑based development platform, created reusable ETL templates, constructed a Spark‑driven feature and data‑mining platform, and applied Spark to interactive user‑behavior analysis and large‑scale SEM services, highlighting performance gains and operational benefits.

Architecture Digest

Apr 9, 2016

Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications

Meituan, a data‑driven internet service, generates massive daily logs from user clicks, browsing, and orders, which are aggregated, processed, and mined to support recommendation, search, and strategic decisions; thus, selecting an efficient data‑processing engine is crucial for improving overall production efficiency.

Initially relying on Hive SQL with a MapReduce engine, Meituan found the model inadequate for evolving data‑processing and analytical needs, prompting the adoption of Spark in 2014 on a Spark‑on‑YARN setup, which has since become the primary compute engine, achieving a 4:1 ratio of Spark to MapReduce jobs and ten‑fold speed improvements for ETL workloads.

To promote Spark usage, Meituan built an interactive development platform based on Apache Zeppelin, enhancing it with user authentication, audit logs, permission management, and resource isolation; the platform supports three interpreters—Spark (Scala), PySpark, and Spark‑SQL—facilitating data exploration, code debugging, and collaborative development.

Meituan also packaged common Hive‑to‑Tair ETL logic into a reusable Spark ETL template, allowing users to specify source Hive tables, target Tair keys/values, and execution parameters; the template runs with conservative defaults (2 CPU cores, 2 GB memory per executor, up to 100 executors) while exposing only safe tunable settings.

A Spark‑based user‑feature platform was created to aggregate and consolidate feature data across business lines; it performs a two‑layer aggregation—first within each business, then across businesses—yielding a ten‑fold performance boost over MapReduce and providing visualized feature statistics, monitoring, and alerting.

The data‑mining platform leverages the feature store to transform, normalize, discretize, and select features, enabling rapid model development with Spark MLlib or Python libraries; models are stored in a custom format, deployed via scheduled Spark jobs, and include accuracy‑threshold alerts.

For interactive user‑behavior analysis, Meituan built a system serving internal PMs and operators, requiring self‑service queries, sub‑minute response times, and visualization; Spark Core and Spark SQL process billions of daily Hive records, with extensive tuning (operator, shuffle, data‑skew, memory) achieving >90% of jobs within 5 minutes.

In the SEM (search‑engine‑marketing) domain, Meituan developed the Medusa service on Spark, offering a low‑threshold SQL‑like interface for keyword management, high‑performance scalable execution, and high availability through pure‑function design, executor scaling, and robust logging to Hive.

Overall, Meituan’s Spark adoption—from ETL to feature engineering, model training, interactive analytics, and SEM services—has reduced duplicated effort, accelerated development cycles, and delivered significant performance and reliability improvements across the company.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Platform ETL Distributed Computing Spark Meituan

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.