Iceberg Data Lake Implementation and Optimization at iQIYI
This article details iQIYI's adoption of Iceberg for its data lake, covering the OLAP architecture, reasons for a data lake, Iceberg's table format advantages over Hive, platform construction, streaming ingestion, query and performance optimizations, real‑world business deployments, and future plans.
1. iQIYI OLAP Overview
iQIYI's OLAP supports three storage types: offline HDFS for batch analysis, real‑time Kafka for streaming, and near‑real‑time Iceberg with minute‑level latency.
Query engines include SparkSQL for ETL, Trino for ad‑hoc queries, and ClickHouse for accelerated queries, all accessed via the Pilot unified query service.
2. Why a Data Lake
Data lakes accelerate data flow by providing a unified storage layer that balances cost, capacity, and latency, addressing the limitations of pure batch (high latency) and pure streaming (high cost) solutions.
1. Data Lake Accelerates Data Circulation
Pingback data, for example, historically used a Lambda architecture with separate offline (HDFS) and real‑time (Kafka) paths, each with trade‑offs in cost and latency.
2. Iceberg – A New Table Format
Iceberg is an open‑source table format that abstracts storage (HDFS or object storage) and provides file‑level metadata, supporting multiple compute engines (Hive, Flink, Spark) and query engines.
Compared to Hive’s partition‑level metadata, Iceberg offers file‑level metadata, enabling incremental processing and near‑real‑time updates.
3. Data Lake Platform Construction
The platform integrates data sources (Pingback, MySQL binlog, logs) into real‑time, offline, and Iceberg channels, with RCP and Babel handling streaming and batch ingestion, and Trino/SparkSQL for queries.
1. Platform Overview
2. Streaming Ingestion
Ingestion from Kafka involves three steps: selecting the start offset, applying transformations (filtering, renaming), and defining the target Iceberg table.
3. Out‑Lake Query
Iceberg V1 supports append‑only data; V2 adds row‑level updates. Trino queries V1, SparkSQL queries V2, with the Pilot engine automatically selecting the appropriate engine.
4. Performance Optimization
Key optimizations include handling small files, smart merging, Bloom filter integration, Alluxio caching, and Trino metadata tuning.
1. Small Files
Lifecycle policies automatically expire old data, and Spark procedures drop partitions, expire snapshots, delete orphan files, and rewrite metadata to clean up.
Additional improvements include using a resident Spark mode, direct directory deletion for daily partitions, and a recycle‑bin mechanism for accidental deletions.
2. Smart Merge
Smart merge automatically selects partitions to merge based on file count variance and table weight, removing the need for manual configuration.
3. Merge Performance
Issues such as lingering Delete files after rewrite have been fixed, and bucket partitioning reduces merge failures on large tables.
4. Write Parameter Control
Configuring Flink’s Distribution‑mode (Hash vs None) and combining with bucket partitioning controls the number of output files, aiming for ~100 MB per file to avoid small‑file proliferation.
5. Query Optimization
Point queries suffer from full scans due to ineffective MinMax filtering; integrating Bloom filters (supported in Parquet 1.12) dramatically reduces scan time.
6. Bloom Filter Integration
Bloom filters are built per data file for columns like order_id, enabling fast exclusion of non‑matching files; tests show query time drops from ~1000 s to ~10 s, with ~3 % extra storage overhead.
7. Alluxio Caching
Alluxio leverages idle SSDs to cache hot data, smoothing HDFS performance spikes; P90 latency for Venus logs dropped from 18 s to 1 s.
8. Trino Metadata Issue
Excessive byte‑by‑byte reads caused metadata fetches to be slow; fixing the read implementation reduced metadata load time from 3 s to 0.5 s.
5. Business Deployments
1. Ad Flow Integration
Iceberg unifies real‑time and batch ad data, eliminating progress management and enabling end‑to‑end SQL development, reducing ad bidding latency from 35 min to 7‑10 min.
2. Venus Log Lake
Switching from Elasticsearch to Iceberg cut storage costs dramatically and improved stability, reducing incident rates by over 80 %.
3. Audit Scenario
Iceberg’s row‑level updates replace a MongoDB+Elasticsearch pipeline, enabling real‑time alerts and simplifying reporting.
4. CDC Order Lake
Flink CDC streams MySQL changes into Iceberg with minute‑level latency, offering lower cost and less operational overhead than Kudu.
6. Future Plans
iQIYI aims to expand Iceberg usage to feature production, leverage Iceberg’s Puffin statistics for query acceleration, and explore Branch/Tag capabilities for internal scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
