Big Data 21 min read

Iceberg Data Lake Implementation and Optimization at iQIYI

This article details iQIYI's adoption of Iceberg for its data lake, covering the OLAP architecture, reasons for a data lake, Iceberg's table format advantages over Hive, platform construction, streaming ingestion, query and performance optimizations, real‑world business deployments, and future plans.

Big Data Technology & Architecture

Jun 13, 2023

Iceberg Data Lake Implementation and Optimization at iQIYI

1. iQIYI OLAP Overview

iQIYI's OLAP supports three storage types: offline HDFS for batch analysis, real‑time Kafka for streaming, and near‑real‑time Iceberg with minute‑level latency.

Query engines include SparkSQL for ETL, Trino for ad‑hoc queries, and ClickHouse for accelerated queries, all accessed via the Pilot unified query service.

2. Why a Data Lake

Data lakes accelerate data flow by providing a unified storage layer that balances cost, capacity, and latency, addressing the limitations of pure batch (high latency) and pure streaming (high cost) solutions.

1. Data Lake Accelerates Data Circulation

Pingback data, for example, historically used a Lambda architecture with separate offline (HDFS) and real‑time (Kafka) paths, each with trade‑offs in cost and latency.

2. Iceberg – A New Table Format

Iceberg is an open‑source table format that abstracts storage (HDFS or object storage) and provides file‑level metadata, supporting multiple compute engines (Hive, Flink, Spark) and query engines.

Compared to Hive’s partition‑level metadata, Iceberg offers file‑level metadata, enabling incremental processing and near‑real‑time updates.

3. Data Lake Platform Construction

The platform integrates data sources (Pingback, MySQL binlog, logs) into real‑time, offline, and Iceberg channels, with RCP and Babel handling streaming and batch ingestion, and Trino/SparkSQL for queries.

1. Platform Overview

2. Streaming Ingestion

Ingestion from Kafka involves three steps: selecting the start offset, applying transformations (filtering, renaming), and defining the target Iceberg table.

3. Out‑Lake Query

Iceberg V1 supports append‑only data; V2 adds row‑level updates. Trino queries V1, SparkSQL queries V2, with the Pilot engine automatically selecting the appropriate engine.

4. Performance Optimization

Key optimizations include handling small files, smart merging, Bloom filter integration, Alluxio caching, and Trino metadata tuning.

1. Small Files

Lifecycle policies automatically expire old data, and Spark procedures drop partitions, expire snapshots, delete orphan files, and rewrite metadata to clean up.

Additional improvements include using a resident Spark mode, direct directory deletion for daily partitions, and a recycle‑bin mechanism for accidental deletions.

2. Smart Merge

Smart merge automatically selects partitions to merge based on file count variance and table weight, removing the need for manual configuration.

3. Merge Performance

Issues such as lingering Delete files after rewrite have been fixed, and bucket partitioning reduces merge failures on large tables.

4. Write Parameter Control

Configuring Flink’s Distribution‑mode (Hash vs None) and combining with bucket partitioning controls the number of output files, aiming for ~100 MB per file to avoid small‑file proliferation.

5. Query Optimization

Point queries suffer from full scans due to ineffective MinMax filtering; integrating Bloom filters (supported in Parquet 1.12) dramatically reduces scan time.

6. Bloom Filter Integration

Bloom filters are built per data file for columns like order_id, enabling fast exclusion of non‑matching files; tests show query time drops from ~1000 s to ~10 s, with ~3 % extra storage overhead.

7. Alluxio Caching

Alluxio leverages idle SSDs to cache hot data, smoothing HDFS performance spikes; P90 latency for Venus logs dropped from 18 s to 1 s.

8. Trino Metadata Issue

Excessive byte‑by‑byte reads caused metadata fetches to be slow; fixing the read implementation reduced metadata load time from 3 s to 0.5 s.

5. Business Deployments

1. Ad Flow Integration

Iceberg unifies real‑time and batch ad data, eliminating progress management and enabling end‑to‑end SQL development, reducing ad bidding latency from 35 min to 7‑10 min.

2. Venus Log Lake

Switching from Elasticsearch to Iceberg cut storage costs dramatically and improved stability, reducing incident rates by over 80 %.

3. Audit Scenario

Iceberg’s row‑level updates replace a MongoDB+Elasticsearch pipeline, enabling real‑time alerts and simplifying reporting.

4. CDC Order Lake

Flink CDC streams MySQL changes into Iceberg with minute‑level latency, offering lower cost and less operational overhead than Kudu.

6. Future Plans

iQIYI aims to expand Iceberg usage to feature production, leverage Iceberg’s Puffin statistics for query acceleration, and explore Branch/Tag capabilities for internal scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Flink Query Optimization OLAP Data Lake Spark Iceberg

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.