Evolution of iQIYI Data Warehouse from 1.0 to 2.0: Architecture, Modeling, Metadata, and Data Lineage
The talk chronicles iQIYI’s shift from a fragmented five‑layer Data Warehouse 1.0 to a unified 2.0 architecture featuring a central Dimension Layer, business‑focused data marts, and subject‑oriented warehouses, while detailing platform services, rigorous metadata management, lineage tracking, and future goals of intelligent, automated, service‑oriented, model‑driven data governance.
The presentation outlines iQIYI's overall business landscape and the design of its Data Warehouse 1.0, highlighting the shortcomings that prompted the evolution to Data Warehouse 2.0.
Data Warehouse 1.0 is described as a five‑layer architecture consisting of Raw Data Layer, Detail Layer, Aggregation Layer, Application Layer, and a shared Dimension Layer. The Raw Data Layer aggregates data from Pingback events, business databases, and third‑party sources. The Detail Layer restores business processes at the finest granularity, the Aggregation Layer stores lightly and heavily aggregated data, and the Application Layer delivers customized results to downstream systems. The architecture suffered from siloed, business‑centric designs, leading to duplicated efforts, inconsistent metric definitions, data ambiguity, low efficiency, and a lack of tooling support.
Data Warehouse 2.0 addresses these issues by redefining the layered structure into a Unified Warehouse, Business Data Marts, and Subject‑oriented Warehouses, all built upon a unified Dimension Layer. The Unified Warehouse provides comprehensive raw and aggregated data, serving as the foundation for downstream marts. Business Data Marts are constructed per business need, emphasizing low coupling and high cohesion to simplify maintenance during organizational changes. Subject Warehouses focus on cross‑business analytical domains such as traffic, content, and user behavior.
The talk then details the construction of a Data Warehouse platform, which includes foundational services (Hive, MySQL, Kafka, ClickHouse), auxiliary functions (ticket, permission, resource management), and core modules (warehouse management, data model management). The platform offers unified APIs for dimensions and metrics, integrates with a metadata center, and pushes model information to a metadata hub for data discovery.
Metadata management is emphasized: three types of dimensions (regular, enumeration, virtual) are defined, along with a rigorous process for dimension and metric definition, ensuring consistency and global uniqueness. The metadata center, built on Apache Atlas with JanusGraph and Elasticsearch, captures both technical and business metadata and constructs data lineage through hooks in Hive, Spark, and a custom integration platform (BabelX).
Data lineage enables impact analysis, fault isolation, and chain tracing, supporting asset classification and automated governance. The platform also provides a Data Graph service that visualizes assets, their relationships, and lineage, reducing the cost of data discovery and understanding.
Future directions focus on four pillars: Intelligent (automated data quality prediction with dynamic thresholds), Automated (standardized, code‑generated modeling workflows), Service‑oriented (exposing data via APIs to unify access), and Model‑driven (abstracting physical tables behind reusable data models for end‑users).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
