Evolution of iQIYI Data Warehouse from 1.0 to 2.0: Architecture, Modeling Practices, and Future Directions
This article details iQIYI's transition from a fragmented Data Warehouse 1.0 to a unified, standardized Data Warehouse 2.0, covering layered architecture, dimension and metric design, modeling workflows, metadata management, data lineage, and upcoming intelligent and automated data platform initiatives.
Introduction
The article introduces iQIYI's product matrix and the need for a unified, standardized data warehouse to solve cross‑business data challenges.
Data Warehouse 1.0
Architecture consists of five layers: raw data, detail layer, aggregation layer, application layer, and a dimension layer for consistent dimensions.
Raw data layer stores data from pingback collection, business databases, and third‑party sources.
Detail layer restores business processes at the finest granularity.
Aggregation layer holds lightly and heavily aggregated data using dimensional modeling.
Application layer provides customized results for reports and downstream systems.
Problems identified include siloed warehouses, inconsistent metrics, data duplication, and lack of tooling.
Data Warehouse 2.0
To address 1.0 shortcomings, iQIYI evolved to a 2.0 architecture with three major parts: Unified Warehouse, Business Marts, and Subject Warehouses, all sharing a common dimension layer.
Unified Warehouse : Core raw and detail layers that ingest all source data, normalize formats, and provide a unified aggregation layer with consistent metrics and device libraries.
Business Marts : Business‑specific data sets built on the unified warehouse, isolated to avoid cross‑dependency while supporting flexible reporting.
Subject Warehouses : Company‑wide thematic domains (e.g., traffic, content, user) built on shared dimensions for cross‑business analysis.
The application layer consumes data from both business marts and subject warehouses.
Data Platform Overview
The platform includes foundational services (Hive, MySQL, Kafka, ClickHouse), auxiliary services (ticket, permission, resource management), and core modules for warehouse management and data modeling.
Modeling is divided into three stages: business modeling, data modeling, and physical modeling.
Dimension & Metric System
Dimensions are classified as ordinary, enumeration, or virtual. Each dimension includes attributes such as English/Chinese names, data type, and description, with tags indicating business or common usage.
Metrics consist of atomic metric metadata, composite metric metadata, modifiers, time periods, and statistical indicators (atomic and composite).
Modeling Process
Business Modeling : Identify business domains, processes, events, entities, and construct a business bus matrix.
Data Modeling : Refine the bus matrix into star (or snowflake) schemas by confirming business scope, processes, dimensions, measures, and degenerated dimension attributes.
Physical Modeling : Materialize models into physical tables/views (e.g., Hive) with proper naming, description, partitioning, lifecycle, and register metadata in the metadata center.
Metadata Center & Data Lineage
Built on Apache Atlas with JanusGraph for lineage and Elasticsearch for search. It captures technical and business metadata, automates lineage via Hive/Spark hooks, and integrates with internal data integration tools.
Provides searchable catalogs, graph visualizations, and supports “find data” and “use data” workflows.
Future Directions
Intelligent : Apply machine learning to data quality monitoring for dynamic thresholds.
Automation : Standardize modeling workflows to generate code automatically.
Service‑Oriented : Expose data via unified APIs to decouple applications from underlying storage.
Model‑Centric : Shift user interaction from physical tables to logical models, enabling federated queries and automatic routing.
Conclusion
The article summarizes the current state and outlines upcoming work to make the data platform smarter, more automated, service‑driven, and model‑focused.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
