Why Data Lakes Are Transforming Big Data: Concepts, Benefits, and Iceberg in Practice
This article explains the evolution of data lakes, compares public‑cloud and private‑cloud implementations, outlines key technical features, presents three real‑world scenarios, details the selection and inner workings of Apache Iceberg versus Hive, and showcases multiple production use cases at iQIYI.
What Is a Data Lake
A data lake provides a virtually unlimited centralized storage for structured, semi‑structured, and unstructured data. It evolved from on‑premise databases that stored only processed structured data, losing raw information. Modern data lakes ingest diverse data types into a unified repository using Hadoop or cloud object storage (e.g., AWS S3, Google Cloud Storage, Alibaba OSS).
Why Data Lakes Are Needed
Real‑time event stream analysis : Near‑real‑time visibility (1‑5 min), large scale, low cost, and fast interactive queries outperform batch‑only Hive processing.
Change data capture (CDC) : Row‑level and column‑level updates can be ingested with low latency, avoiding costly full‑export jobs.
Stream‑batch integration : A unified pipeline eliminates duplicated code, reduces inconsistency, and lowers infrastructure cost.
Data Lake Selection and Core Principles
iQIYI evaluated three open‑source table formats—Hudi, Iceberg, and Delta Lake—and selected Apache Iceberg as the core engine. Iceberg is a table format (not a storage or query engine) that stores data files in Parquet on HDFS or object storage and can be queried via Spark, Flink, Trino, or Hive.
Key Differences Between Hive and Iceberg
Hive metadata is partition‑level only; Iceberg tracks file‑level metadata.
Iceberg supports snapshot isolation, parallel lock‑free writes with optimistic locking, and fast planning by reading metadata files directly.
Iceberg enables file‑level filtering (min/max statistics, Bloom filters) to prune irrelevant files.
Iceberg supports row‑level updates via DeleteFile and Merge‑On‑Read, enabling incremental pipelines.
Advantages of Iceberg
Read/write isolation via separate snapshots.
Parallel, lock‑free writes.
Faster query planning and execution.
Efficient handling of small files and incremental pulls.
Row‑level updates (V2 format) with merge‑on‑read semantics.
Iceberg Table Format Details
Iceberg is not a storage engine; it uses HDFS or S3 as the underlying storage. It is not a file format; data files are Parquet. It is not a query engine; it can be accessed by Spark, Flink, Trino, Hive, etc.
Iceberg stores a table’s current snapshot in the Hive Metastore. Each snapshot references a set of data files; new snapshots are created atomically and are invisible to readers until committed. This enables:
Snapshot isolation : Readers see a consistent view while writers create new snapshots.
Optimistic concurrency : Writers commit only if no conflicting snapshot exists.
Fast planning : Metadata files contain file‑level statistics, allowing engines to prune files without scanning directories.
Incremental pull : Differences between two snapshots can be enumerated to stream only changed data.
Row‑level updates : Implemented via DeleteFile (records to delete) and DataFile additions; Merge‑On‑Read combines them at query time. V2 format adds periodic compaction to rewrite delete files into new data files, controlling file count.
Business Deployments Using Iceberg
Venus Log Collection Platform
Original architecture used Elasticsearch, which suffered high write cost, limited replication, and frequent write failures. Replacing Elasticsearch with Iceberg on HDFS reduced storage cost, eliminated write failures, achieved interactive query latency (seconds), and scaled to petabytes.
Audit Data Platform
Legacy stack (MongoDB + Elasticsearch + MySQL + Hive) caused high development overhead, query bottlenecks, and data‑quality issues. Migrating audit data to Iceberg enabled row‑level updates, fast multi‑column filtering, and PB‑scale storage, reducing operational effort and improving data freshness.
Pingback Stream‑Batch Integration
Legacy Lambda architecture split real‑time (Kafka + Flink) and batch (HDFS + Hive) paths, causing duplication and latency. The new Iceberg‑based pipeline consumes Kafka events with Flink, writes ODS tables, builds DWD tables, and serves both real‑time and batch analytics via SparkSQL. Latency is ~5 minutes, code base is unified, and infrastructure cost is lower.
Member Order Analytics
Traditional MySQL → Hive export incurred daily latency and high MySQL load; CDC → Kudu introduced storage pressure and operational complexity. Using Iceberg with Flink CDC ingestion provides minute‑level latency, fast SparkSQL queries comparable to Kudu, lower cost (no dedicated Kudu cluster), and minimal impact on MySQL.
All cases demonstrate consistent benefits: cost reduction, write stability, operational simplification, and high‑quality near‑real‑time data.
References
Dixon, J. (2010). “Pentaho, Hadoop, and Data Lakes”.
AWS. “What is a data lake”.
Google Cloud. “What is a data lake”.
《数据湖 | 一文读懂 Data Lake 的概念、特征、架构与案例》.
Uber’s case for incremental processing on Hadoop.
Iceberg: A modern table format for big data.
Apache Iceberg: An Architectural Look Under the Covers.
Iceberg Table Spec.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
