Big Data 18 min read

Iceberg Data Lake Implementation and Optimization at iQIYI

This article details iQIYI's adoption of the Iceberg data lake, covering its OLAP architecture, reasons for a lake, Iceberg table format advantages over Hive, platform construction, extensive performance optimizations, and real‑world business use cases such as ad‑flow unification, log analysis, audit, and CDC pipelines.

DataFunTalk
DataFunTalk
DataFunTalk
Iceberg Data Lake Implementation and Optimization at iQIYI

The presentation introduces iQIYI's OLAP system and explains why a data lake is needed, highlighting the limitations of traditional Hive tables and the benefits of the Iceberg open‑source table format, which provides file‑level metadata, snapshot isolation, and support for multiple compute engines.

iQIYI built a unified lake platform where data from Kafka, batch HDFS, and near‑real‑time Iceberg streams are ingested via SparkSQL and Trino, with Pilot handling query routing. The architecture includes a storage layer (HDFS/object store), query engines, and metadata management.

Key performance optimizations include lifecycle‑based small‑file cleanup, intelligent merge based on file count variance, Spark Procedure for efficient expiration, and Spark’s persistent mode to reduce job startup overhead. Additional techniques such as Bloom filters for point‑query acceleration, Alluxio caching to mitigate HDFS latency, and Trino metadata read improvements are also described.

Business impact is demonstrated through four case studies: (1) ad‑flow batch‑stream integration reducing end‑to‑end latency from 35 minutes to 7‑10 minutes; (2) Venus log lake replacing costly Elasticsearch with Iceberg, cutting storage cost and improving stability; (3) audit scenario enabling row‑level updates within Iceberg; and (4) CDC order ingestion using Flink CDC, achieving minute‑level latency, lower cost, and reduced operational burden.

Future plans involve extending the lake to feature production, leveraging Iceberg’s Puffin statistics for query acceleration, and exploring branch/tag capabilities for internal use cases.

Performance OptimizationBig DataFlinkSparkSQLdata lakeicebergTrino
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.