Big Data 25 min read

Why Data Lakes Are Transforming Big Data: Concepts, Benefits, and Iceberg in Practice

This article explains the evolution of data lakes, compares public‑cloud and private‑cloud implementations, outlines key technical features, presents three real‑world scenarios, details the selection and inner workings of Apache Iceberg versus Hive, and showcases multiple production use cases at iQIYI.

dbaplus Community
dbaplus Community
dbaplus Community
Why Data Lakes Are Transforming Big Data: Concepts, Benefits, and Iceberg in Practice

What Is a Data Lake

A data lake provides a virtually unlimited centralized storage for structured, semi‑structured, and unstructured data. It evolved from on‑premise databases that stored only processed structured data, losing raw information. Modern data lakes ingest diverse data types into a unified repository using Hadoop or cloud object storage (e.g., AWS S3, Google Cloud Storage, Alibaba OSS).

Why Data Lakes Are Needed

Real‑time event stream analysis : Near‑real‑time visibility (1‑5 min), large scale, low cost, and fast interactive queries outperform batch‑only Hive processing.

Change data capture (CDC) : Row‑level and column‑level updates can be ingested with low latency, avoiding costly full‑export jobs.

Stream‑batch integration : A unified pipeline eliminates duplicated code, reduces inconsistency, and lowers infrastructure cost.

Data Lake Selection and Core Principles

iQIYI evaluated three open‑source table formats—Hudi, Iceberg, and Delta Lake—and selected Apache Iceberg as the core engine. Iceberg is a table format (not a storage or query engine) that stores data files in Parquet on HDFS or object storage and can be queried via Spark, Flink, Trino, or Hive.

Key Differences Between Hive and Iceberg

Hive metadata is partition‑level only; Iceberg tracks file‑level metadata.

Iceberg supports snapshot isolation, parallel lock‑free writes with optimistic locking, and fast planning by reading metadata files directly.

Iceberg enables file‑level filtering (min/max statistics, Bloom filters) to prune irrelevant files.

Iceberg supports row‑level updates via DeleteFile and Merge‑On‑Read, enabling incremental pipelines.

Advantages of Iceberg

Read/write isolation via separate snapshots.

Parallel, lock‑free writes.

Faster query planning and execution.

Efficient handling of small files and incremental pulls.

Row‑level updates (V2 format) with merge‑on‑read semantics.

Iceberg Table Format Details

Iceberg is not a storage engine; it uses HDFS or S3 as the underlying storage. It is not a file format; data files are Parquet. It is not a query engine; it can be accessed by Spark, Flink, Trino, Hive, etc.

Iceberg stores a table’s current snapshot in the Hive Metastore. Each snapshot references a set of data files; new snapshots are created atomically and are invisible to readers until committed. This enables:

Snapshot isolation : Readers see a consistent view while writers create new snapshots.

Optimistic concurrency : Writers commit only if no conflicting snapshot exists.

Fast planning : Metadata files contain file‑level statistics, allowing engines to prune files without scanning directories.

Incremental pull : Differences between two snapshots can be enumerated to stream only changed data.

Row‑level updates : Implemented via DeleteFile (records to delete) and DataFile additions; Merge‑On‑Read combines them at query time. V2 format adds periodic compaction to rewrite delete files into new data files, controlling file count.

Business Deployments Using Iceberg

Venus Log Collection Platform

Original architecture used Elasticsearch, which suffered high write cost, limited replication, and frequent write failures. Replacing Elasticsearch with Iceberg on HDFS reduced storage cost, eliminated write failures, achieved interactive query latency (seconds), and scaled to petabytes.

Audit Data Platform

Legacy stack (MongoDB + Elasticsearch + MySQL + Hive) caused high development overhead, query bottlenecks, and data‑quality issues. Migrating audit data to Iceberg enabled row‑level updates, fast multi‑column filtering, and PB‑scale storage, reducing operational effort and improving data freshness.

Pingback Stream‑Batch Integration

Legacy Lambda architecture split real‑time (Kafka + Flink) and batch (HDFS + Hive) paths, causing duplication and latency. The new Iceberg‑based pipeline consumes Kafka events with Flink, writes ODS tables, builds DWD tables, and serves both real‑time and batch analytics via SparkSQL. Latency is ~5 minutes, code base is unified, and infrastructure cost is lower.

Member Order Analytics

Traditional MySQL → Hive export incurred daily latency and high MySQL load; CDC → Kudu introduced storage pressure and operational complexity. Using Iceberg with Flink CDC ingestion provides minute‑level latency, fast SparkSQL queries comparable to Kudu, lower cost (no dedicated Kudu cluster), and minimal impact on MySQL.

All cases demonstrate consistent benefits: cost reduction, write stability, operational simplification, and high‑quality near‑real‑time data.

References

Dixon, J. (2010). “Pentaho, Hadoop, and Data Lakes”.

AWS. “What is a data lake”.

Google Cloud. “What is a data lake”.

《数据湖 | 一文读懂 Data Lake 的概念、特征、架构与案例》.

Uber’s case for incremental processing on Hadoop.

Iceberg: A modern table format for big data.

Apache Iceberg: An Architectural Look Under the Covers.

Iceberg Table Spec.

OLAP engine comparison
OLAP engine comparison
Row‑level change capture options
Row‑level change capture options
Lambda vs. data‑lake stream‑batch architecture
Lambda vs. data‑lake stream‑batch architecture
Iceberg product comparison
Iceberg product comparison
Iceberg table format
Iceberg table format
Iceberg row‑level update example
Iceberg row‑level update example
Venus migration to Iceberg
Venus migration to Iceberg
Audit data migration
Audit data migration
Pingback new architecture
Pingback new architecture
Pingback near‑real‑time pipeline
Pingback near‑real‑time pipeline
Member order sync to OLAP
Member order sync to OLAP
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataBatch ProcessingStreamingApache Iceberg
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.