Big Data 17 min read

How Meitu Built a Scalable Big Data Platform for Billions of Daily Events

Meitu's data platform, serving dozens of apps and handling billions of daily events, combines custom log collection, multi‑layer storage, offline and real‑time processing, and open data services to support personalization, anti‑fraud, analytics, and business growth.

dbaplus Community

Aug 27, 2018

How Meitu Built a Scalable Big Data Platform for Billions of Daily Events

Data Application Scenarios

Meitu’s suite of apps (e.g., Meipai, Meitu XiuXiu, Meitu Camera) generates ~200 billion events per day from 5 hundred million active users. The data supports personalized recommendation, search, reporting, anti‑fraud, A/B testing, channel tracking and advertising.

Overall Architecture

The platform is organized into four layers:

Collection layer : Arachnia log‑collection service and AppSDK gather client logs; DataX handles batch data integration; a custom crawler platform imports public data.

Storage layer : HDFS for raw files, MongoDB for document stores, HBase for wide‑column data, Elasticsearch for search.

Computation layer : Offline jobs run on Hive + MapReduce (later Hive‑on‑Spark); real‑time streams are processed by Storm, Flink and a proprietary bitmap system (Naix).

Application layer : Data workshop, data bus, task scheduler, and visualization platforms (A/B testing, channel tracking, user profiling, etc.).

Platform Evolution Stages

Stage 1 : Use free third‑party analytics; only basic metrics are available, no raw data.

Stage 2 : Open raw data and compute resources to business lines, enabling self‑service data development.

Stage 3 : Emphasize query speed, real‑time latency and resource efficiency as data volume and cluster size grow.

Log Collection System (Arachnia)

Requirements:

Automated deployment and upgrade of agents.

At‑least‑once delivery guarantee.

Aggregation across multiple IDC sites.

Minimal impact on client resources.

Arachnia consists of a central coordinator , per‑IDC agents , and a collector . Each collection transaction is assigned a globally unique txid. Agents generate fileID (hash of inode + file header) and MsgID (agentID + fileID + offset) to support deduplication and downstream cleaning.

Kafka → HDFS Ingestion

Collected logs are pushed to Kafka. A MapReduce job reads each Kafka partition, parses and validates records, then writes them to HDFS according to configurable partition rules (e.g., by date, app, event type). Offsets of processed partitions are persisted in MySQL to enable incremental re‑runs.

To mitigate data skew:

Small partitions are merged into a single input split so one mapper can process multiple partitions.

Large partitions are split across several mappers.

A two‑stage write strategy is used: mapper 1 writes to a temporary directory; mapper 2 appends the temporary files to the final HDFS target. This allows fast re‑processing of a failed batch without re‑reading an entire day’s data.

Real‑time Distribution (Databus)

Databus is a Storm‑based topology that lets business teams define custom filtering rules. It consumes raw streams from Kafka, applies rule‑based matching, and forwards only the required subset to downstream Kafka clusters, reducing unnecessary data consumption.

Stability and Security Enhancements

Cluster upgrades :

Hive upgraded from 0.13 to 2.1; Hadoop from 2.4 to 2.7.

HA deployment for HiveServer2 and MetaStore (multiple nodes, failover).

Migrated execution engine from Hive‑on‑MapReduce to Hive‑on‑Spark.

Internal patches applied via a private branch to back‑port community fixes.

Security :

Unified data access via OneDataAPI protected by a CA‑issued token; only authenticated services can query data.

Cluster‑wide authorization enforced by Apache Ranger for Kafka, HBase, Hadoop, etc.

Key Lessons Learned

Assess business scale, number of lines, and data demand before committing to a platform; large‑scale, multi‑line businesses benefit most.

Prioritize data quality (completeness, timeliness, unique identifiers) and collection reliability (at‑least‑once, multi‑IDC aggregation).

Continuously monitor and optimize resource consumption, cost, and access controls as usage grows.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

architecture data collection Data Platform

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.