Big Data 23 min read

Understanding Data Lakes and Their Application at iQIYI: Concepts, Scenarios, and Iceberg Implementation

This article explains the definition of data lakes (public‑cloud and non‑public‑cloud), outlines their key characteristics, presents three typical business scenarios—real‑time event analysis, change‑data analysis, and stream‑batch integration—summarizes required product features, evaluates open‑source lake formats, and details iQIYI's adoption of Apache Iceberg across multiple services to achieve low‑latency, large‑scale, cost‑effective analytics.

DataFunTalk
DataFunTalk
DataFunTalk
Understanding Data Lakes and Their Application at iQIYI: Concepts, Scenarios, and Iceberg Implementation

Data lakes were first introduced in 2010 and have evolved into two main definitions: public‑cloud data lakes, where cloud providers such as AWS, Google Cloud, Alibaba Cloud, and Tencent Cloud treat their object storage services (e.g., S3, OSS) as lakes, and non‑public‑cloud data lakes, which refer to table‑format layers like Hudi, Iceberg, and Delta Lake built on Hadoop or object storage.

Key characteristics of a data lake include unified storage that supports structured, semi‑structured, and unstructured data; a common data abstraction layer; support for batch, stream, and machine‑learning workloads; and centralized metadata, lifecycle, and governance to avoid data silos.

Three typical business scenarios illustrate why a data lake is needed:

Real‑time event stream analysis requires near‑real‑time visibility (1‑5 minutes) with high scalability, low cost, and the ability to share data across engines such as Spark, Trino, and Flink.

Change‑data analysis benefits from incremental ingestion of row‑level updates from sources like MySQL or MongoDB, avoiding costly full‑partition rewrites.

Stream‑batch integration replaces Lambda architectures by providing a single codebase that supports both near‑real‑time and batch queries, reducing development effort and data inconsistency.

Based on a comparison of Hudi, Iceberg, and Delta Lake, iQIYI selected Apache Iceberg as the core table format for its snapshot isolation, fast planning, file‑level metadata, and support for row‑level updates (Merge‑On‑Read).

Iceberg differs from Hive in that it stores metadata at the file level, enabling fast query planning, file‑level pruning, and atomic snapshot commits. Iceberg is not a storage engine (it works on HDFS or S3), not a file format (it uses Parquet), and not a query engine (it can be queried via Spark, Flink, Trino, Hive, etc.).

Row‑level updates are achieved by introducing DeleteFile objects and merging them with DataFiles during reads, allowing accurate results while supporting incremental change capture.

iQIYI applied Iceberg in several production systems:

Venus log collection platform: Replaced ElasticSearch with Iceberg on HDFS, achieving lower cost, higher write bandwidth, and 80 % reduction in operational incidents.

Audit data pipeline: Migrated from MongoDB + ElasticSearch + MySQL to Iceberg, enabling low‑latency (≈5 minutes) updates, efficient column‑level queries, and PB‑scale storage.

Pingback stream‑batch integration: Built a near‑real‑time pipeline using Flink to ingest Kafka data into Iceberg ODS/DWD tables, reducing latency to minutes, eliminating the need for separate Lambda pipelines, and cutting costs.

Member order analytics: Replaced MySQL → Hive and CDC → Kudu approaches with Iceberg, delivering sub‑minute latency, fast SparkSQL queries, and lower infrastructure overhead.

Overall, the adoption of Iceberg has enabled iQIYI to achieve large‑scale, low‑cost, near‑real‑time analytics across multiple business domains, while simplifying architecture and improving data quality.

Big Datareal-time analyticsStreamingdata lakeIcebergTable Format
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.