How Meitu Scaled Its Billion-User Data Analytics: Architecture Evolution and Lessons
This article explains how Meitu built and evolved a large‑scale data statistics platform to handle billions of users, detailing the challenges of growing data volume, the architectural shifts from simple scripts to Hadoop, and the design of modular components for job management, scheduling, execution, and future expansion.
Business and Technology Collision
Meitu serves billions of users across many apps, generating massive data that product, operation, and marketing teams rely on for feature optimization, behavior analysis, and performance tracking. The rapid growth created increasing demands for data statistics and analysis.
Architecture Evolution
Initially, each app logged data locally and used rsync to a central node, where simple shell or PHP scripts and crontab jobs aggregated data into MySQL for reporting. This approach quickly hit storage capacity, computation bottlenecks, and high maintenance costs.
Transition to a Big Data Stack
Meitu introduced a data‑collection system that streamed logs to HDFS, deployed a Hadoop cluster for distributed storage and computation, and leveraged Hive to replace custom aggregation scripts, dramatically improving scalability and developer productivity.
Current Pain Points
High cost of understanding business requirements and data sources.
Remaining repetitive code for query, aggregation, and storage.
Operational overhead of packaging and deploying jobs.
Limited personal growth for engineers stuck in repetitive tasks.
Platform Design
To address these issues, Meitu built a platform with three core modules:
JobManager
Manages metadata for statistical tasks, provides a UI for business users to configure required metrics, and integrates data‑warehouse information.
Scheduler
Acts as a centralized scheduler, supporting priority‑based, time‑based, and workflow‑based dispatch of tasks.
JobExecutor
Instantiates query plugins (e.g., Hive), executes queries, performs filtering and dimensional aggregation, and writes results to the appropriate storage layer.
Metadata Model
Each task description includes data source, computation operator, storage medium, filters, dimensions, and dependencies, enabling a uniform processing pipeline: query → filter → aggregate → store.
Extended Capabilities
Ad‑hoc querying via a SQL editor with HOL syntax validation.
Data ingestion from MySQL using a Sqoop‑based plugin.
Bitmap‑based deduplication and retention calculations.
Support for multiple storage backends: MongoDB, HDFS, CSV, MySQL.
Data Visualization
A unified API abstracts underlying storage, allowing front‑end visualization tools to access data without learning each storage technology, while a dedicated visualization platform lets users select data sources and build custom dashboards.
Security and Access Control
A central authentication service (CA) issues tokens; the generic API validates tokens before querying storage, ensuring that, for example, Meipai’s backend can only access Meipai data.
Future Roadmap
Develop a distributed scheduling system to replace the current single‑node scheduler and support resource isolation.
Enhance data visualization capabilities for personalized dashboards.
Build an OLAP service (e.g., based on Kylin) for faster analytical queries.
Integrate real‑time statistics to complement batch processing.
Overall Architecture
A unified scheduler triggers jobs; JobExecutor selects plugins for querying and aggregation; results are stored in a DB accessed via a secure API. The platform continuously evolves to meet growing data volume and analytical needs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
