Big Data 16 min read

How Meitu Scaled Its Billion-User Data Analytics: Architecture Evolution and Lessons

This article explains how Meitu built and evolved a large‑scale data statistics platform to handle billions of users, detailing the challenges of growing data volume, the architectural shifts from simple scripts to Hadoop, and the design of modular components for job management, scheduling, execution, and future expansion.

21CTO

Sep 25, 2017

How Meitu Scaled Its Billion-User Data Analytics: Architecture Evolution and Lessons

Business and Technology Collision

Meitu serves billions of users across many apps, generating massive data that product, operation, and marketing teams rely on for feature optimization, behavior analysis, and performance tracking. The rapid growth created increasing demands for data statistics and analysis.

Architecture Evolution

Initially, each app logged data locally and used rsync to a central node, where simple shell or PHP scripts and crontab jobs aggregated data into MySQL for reporting. This approach quickly hit storage capacity, computation bottlenecks, and high maintenance costs.

Transition to a Big Data Stack

Meitu introduced a data‑collection system that streamed logs to HDFS, deployed a Hadoop cluster for distributed storage and computation, and leveraged Hive to replace custom aggregation scripts, dramatically improving scalability and developer productivity.

Current Pain Points

High cost of understanding business requirements and data sources.

Remaining repetitive code for query, aggregation, and storage.

Operational overhead of packaging and deploying jobs.

Limited personal growth for engineers stuck in repetitive tasks.

Platform Design

To address these issues, Meitu built a platform with three core modules:

JobManager

Manages metadata for statistical tasks, provides a UI for business users to configure required metrics, and integrates data‑warehouse information.

Scheduler

Acts as a centralized scheduler, supporting priority‑based, time‑based, and workflow‑based dispatch of tasks.

JobExecutor

Instantiates query plugins (e.g., Hive), executes queries, performs filtering and dimensional aggregation, and writes results to the appropriate storage layer.

Metadata Model

Each task description includes data source, computation operator, storage medium, filters, dimensions, and dependencies, enabling a uniform processing pipeline: query → filter → aggregate → store.

Extended Capabilities

Ad‑hoc querying via a SQL editor with HOL syntax validation.

Data ingestion from MySQL using a Sqoop‑based plugin.

Bitmap‑based deduplication and retention calculations.

Support for multiple storage backends: MongoDB, HDFS, CSV, MySQL.

Data Visualization

A unified API abstracts underlying storage, allowing front‑end visualization tools to access data without learning each storage technology, while a dedicated visualization platform lets users select data sources and build custom dashboards.

Security and Access Control

A central authentication service (CA) issues tokens; the generic API validates tokens before querying storage, ensuring that, for example, Meipai’s backend can only access Meipai data.

Future Roadmap

Develop a distributed scheduling system to replace the current single‑node scheduler and support resource isolation.

Enhance data visualization capabilities for personalized dashboards.

Build an OLAP service (e.g., based on Kylin) for faster analytical queries.

Integrate real‑time statistics to complement batch processing.

Overall Architecture

A unified scheduler triggers jobs; JobExecutor selects plugins for querying and aggregation; results are stored in a DB accessed via a secure API. The platform continuously evolves to meet growing data volume and analytical needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Platform Hive Hadoop Job Scheduling

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.