Big Data 16 min read

Meitu Data Platform Architecture and Practices

Meitu’s data platform, serving dozens of apps with 500 million monthly active users and billions of daily events, combines the Arachnia log‑collection system, Kafka ingestion, multi‑layer storage (HDFS, MongoDB, HBase, Elasticsearch), offline Hive/MapReduce processing and real‑time Storm/Flink/Naix pipelines, supported by data‑workshop tools, staged evolution for scalability, and robust security and query‑validation mechanisms.

Meitu Technology

Aug 14, 2018

Meitu Data Platform Architecture and Practices

Meitu's data technology team presented the 11th Meitu Tech Salon, describing the company's data platform that supports dozens of apps (Meipai, MeituPic, BeautyCam, etc.) with 500 million monthly active users and nearly 200 billion daily events.

The platform was built to meet diverse business needs such as personalized recommendation, search, reporting, anti‑fraud, and advertising across many product lines.

Overall Architecture

The platform consists of data collection, storage, computation, and application layers. Data collection is handled by a custom log‑collection system called Arachnia and an App SDK, which feed data into a collector that writes to Kafka.

The storage layer uses HDFS, MongoDB, HBase, Elasticsearch, etc., while offline computation relies on Hive & MapReduce and real‑time processing uses Storm, Flink, and a proprietary bitmap system Naix.

Data development tools include a data workshop, data bus, task scheduler, and various visualization platforms (A/B testing, channel tracking, user profiling).

Stage‑wise Development

The platform evolved through three stages: (1) rapid integration of third‑party analytics, (2) building internal data pipelines to improve development efficiency, and (3) scaling for higher performance, lower latency, and cost efficiency.

From 0 to 1 – Arachnia

Arachnia replaced third‑party services, rsync‑based log collection, and ad‑hoc Python scripts. It provides at‑least‑once reliability, multi‑IDC aggregation, low resource consumption, and unique message IDs generated from inode‑based file hashes.

The system uses a coordinator to manage transaction IDs, ensuring exactly‑once semantics for successful steps and replay for failed steps.

Kafka → HDFS Pipeline

Data from Kafka is ingested by an ETL service that supports multiple formats (JSON, Avro, custom delimiters), fault‑tolerant replay, configurable HDFS partitioning, and custom business logic (validation, filtering, injection).

To mitigate data skew, small partitions are merged into a single input split, while large partitions are split across multiple mappers, achieving balanced processing.

Databus

Databus, built on Storm, provides rule‑based real‑time data distribution, allowing downstream services to subscribe only to relevant events.

Platform Stability and Security

After opening raw data to business teams, the platform faced high‑resource and illegal queries. Validation rules were added at the HiveSQL parsing stage (using ANTLR to generate AST, then QueryBlock) to filter unsafe statements.

Cluster upgrades (Hive 0.13 → 2.1, Hadoop 2.4 → 2.7) and HA deployments for HiveServer/MetaStore were performed. Migration from Hive‑on‑MapReduce to Hive‑on‑Spark is in progress.

Security enhancements include API authentication via a unified CA service for OneDataAPI and cluster‑wide authorization using Apache Ranger.

Key Takeaways

Understand business scale and diversity before building a data platform.

Focus on data quality and platform stability (complete pipelines, reliable collection, fault‑tolerant processing).

Continuously optimize cost and resources as the platform grows.

(End of presentation)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data engineering Big Data Streaming Data Platform Hive ETL

Written by

Meitu Technology

Curating Meitu's technical expertise, valuable case studies, and innovation insights. We deliver quality technical content to foster knowledge sharing between Meitu's tech team and outstanding developers worldwide.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.