iQIYI Data Middle Platform: Architecture, Data Governance Practices, and Future Plans
The article details iQIYI’s data middle platform architecture and its comprehensive data governance practices, covering platform overview, data flow, unified standards, metadata management, production quality assurance, and future AI‑driven enhancements, illustrating how centralized data services improve reliability, efficiency, and security.
Guest: Du Yifan, Senior Manager at iQIYI; Editor: Gan Yuxin, Shanghai University of Finance and Economics; Platform: DataFunTalk.
01 iQIYI Data Platform Introduction
iQIYI has evolved from a single video site to a complex ecosystem covering content, surrounding services, and a growing variety of business lines, making data increasingly critical for operations and decision‑making.
Key challenges that motivated the construction of a data middle platform include high integration cost, steep usage thresholds, low tracking data quality, unreliable data affecting downstream analytics, difficulty in cross‑business data fusion, inconsistent metric definitions, ambiguous data assets, and excessive resource consumption.
The data flow diagram shows how raw pingback events are collected, stored in Hadoop or Kafka, processed in real‑time or batch, and finally served to downstream applications via unified data services.
Before the platform, each team handled its own data pipelines, leading to inconsistent data quality and high downstream integration costs. Centralizing data processing and governance eliminates siloed development and standardizes data definitions.
02 Data Governance
Data governance is achieved through unified standards, metadata management, data lineage, change detection, and automatic metadata collection.
Unified Tracking Specification
The Pingback tracking specification defines event categories (launch, exit, play, exposure, reading, QoS) and a unified field dictionary, reducing custom implementations and improving data consistency.
Pingback SDKs for Android, iOS, and PC encapsulate tracking logic, while a centralized tracking platform validates data before it reaches Hive, providing testing and gray‑release monitoring.
Data Warehouse Specification
A new warehouse specification addresses duplicate construction, inconsistent dimensions/metrics, and fragmented processing logic by defining a layered architecture: ODS (raw), DWD (detail), MID (aggregated), and business data marts.
The warehouse management platform provides unified dimension/metric management, metadata services, and data modeling tools, enabling automatic lineage tracking and facilitating natural‑language query generation.
Metadata Management
Metadata describes data assets, their lineage, and change history. A centralized metadata center stores asset information, supports business‑oriented graphs, and enables fast discovery of required datasets.
Data lineage captures upstream/downstream relationships from Hive, Spark, Kafka, and workflow systems, storing the information in Elasticsearch for real‑time queries.
Change detection uses lineage to automatically notify downstream consumers when schemas or processing logic change, ensuring timely remediation.
Automatic metadata collection integrates hooks from Hive, Spark, and Kafka, converting technical and business metadata into a unified graph stored in Elasticsearch or JanusGraph.
03 Production Governance
Production governance ensures data quality, timeliness, and correctness through multi‑stage validation, monitoring, and automated recovery.
Data Quality Assurance
During testing, data is validated against predefined rules; successful validation allows release. In production, monitoring detects anomalies, triggers alerts, and blocks faulty data from downstream consumption.
Tracking Data Quality Assurance
Tracking definitions are reviewed for completeness and relevance; SDKs collect data, automated tests verify compliance, and gray‑release monitoring ensures real‑time quality checks before full rollout.
Data Production Chain Assurance
The unified development platform orchestrates workflows, runs pre‑deployment tests for reliability, and provides monitoring and alerting for data quality rules. High‑availability is achieved by running critical jobs on dual clusters.
Data Quality Monitoring Platform
The platform applies configured rules to monitor growth rates, field distributions, and absolute values, generating alerts for deviations. Prophet‑based time‑series forecasting is used for core metrics to reduce false positives.
04 Future Planning
Upcoming work focuses on deeper AI integration for intelligent resource allocation, automated anomaly analysis, and enhanced metadata services to simplify data consumption.
Value‑driven management will assess storage and compute costs against usage, consolidating low‑usage tables and promoting high‑value assets.
05 Q&A Highlights
Q: How are unstructured video and text data managed? A: They are stored in a metadata knowledge graph, providing structured dimensions for the warehouse.
Q: How is real‑time metadata monitoring performed? A: Hooks on Hive and Spark capture lineage events, which are streamed via message queues to the metadata graph.
Q: How are data anomalies balanced with downstream impact? A: Core metrics trigger blocking; less critical fields generate alerts without blocking.
Q: Are all time‑series suitable for Prophet forecasting? A: Only core, clearly periodic metrics are used; other business‑specific series may not be appropriate.
Finally, the speaker thanks the audience and encourages sharing, liking, and following the DataFunTalk community.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
