The Evolution of iQIYI's Big Data Analytics Platform
This article chronicles iQIYI’s journey from a simple Hive‑based data pipeline to the sophisticated, multi‑engine “Tongtian Tower” platform, detailing the development of the Magic Mirror system, the Gear workflow manager, BabelBD, the Monet visual analytics tool, and the integrated BI ecosystem that now supports billions of daily users.
iQIYI’s data platform serves a massive user base with daily active users approaching 300 million, over 30 billion devices, and more than 300 TB of user behavior logs processed each day, imposing stringent requirements on data operations and development.
1. The Beginning Era
Initially, logs were transferred via RSYNC into Hive, processed by shell‑driven Hive SQL, and results were imported into MySQL for reporting, with Java handling the reporting layer. This manual pipeline caused long data‑delivery cycles and heavy developer workload.
2. The Magic Mirror Era
The Magic Mirror system introduced the Accio Log collector to upload logs from Pingback servers to HDFS, and the Transfiguration framework to parse and split logs for storage. Users could self‑service data extraction without waiting for development schedules. However, rapid business growth led to massive log volumes that overloaded Hadoop clusters, and script‑based development became unsustainable.
3. The Tongtian Tower Era
The Tongtian Tower unified all data, compute resources, and service frameworks across iQIYI. Offline processing relies on Hive and Spark; streaming uses Spark Streaming and Flink; OLAP queries run on Impala and Kylin. Storage includes HDFS, HBase, and Kudu (real‑time), while operational databases are MySQL and MongoDB. A dedicated development platform manages workflows, data lineage, cross‑DC synchronization, and data‑warehouse components such as ingestion management, metric‑dimension management, and model management.
4. Workflow Management and Development Evolution
Workflow orchestration progressed from simple Crontab scripts to a custom Shell framework, then to LinkedIn’s Azkaban (single‑node), followed by the internally built Gear system, and finally BabelBD, which offers a drag‑and‑drop interface that abstracts away configuration complexity, allowing developers to focus on core SQL logic.
5. iQIYI BI Platform
The BI platform evolved from a Java‑Web MVC reporting system to a configurable reporting platform (Longyuan 2.0) and finally to a large‑scale BI system that abstracts report construction, supports self‑service analysis, and enforces business‑line and permission segregation.
6. Data Management and Done Service
To guarantee data availability, a Done‑file mechanism was introduced, later replaced by a dedicated Done service that avoids HDFS small‑file overload and provides reliable dependency checks for downstream jobs.
7. Data Warehouse Evolution
Initially, analytics consumed raw log tables directly, then moved to wide tables for convenience, and finally adopted a layered modeling approach (log, detail, aggregate, application layers) with hot/cold partitioning and HBase/Kylin storage to support high‑performance queries.
8. Magic Mirror and Butcher’s Knife (BabelBD)
Magic Mirror provides a UI for self‑service SQL generation, while Butcher’s Knife offers a full‑featured SQL editor. Both route queries to the appropriate execution engine (Impala, Spark, etc.) and perform smart down‑shifting when the primary engine cannot satisfy the request.
9. Monet Visual Analytics System
Monet enables drag‑and‑drop visual analysis, allowing users to build scenes by selecting dimensions and metrics, generate reports, and export data. It integrates with the BI layer and supports multi‑scene composition, automatically generating queries based on user selections.
10. Overall iQIYI Big Data Analysis System
The ecosystem consists of BI reports, Monet analysis, Magic Mirror & Butcher’s Knife for offline data extraction, and various analysis tools (retention, funnel, path, profiling). All components are built on a micro‑service architecture within the enterprise cloud, ensuring scalability and reliability.
In addition to the technical overview, the article includes author information, recruitment details for big‑data engineering roles, and community resources for further learning.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
