Big Data 9 min read

An Introduction to Apache Iceberg: Features, Spark & Flink Integration, and Real‑World Use Cases

This article provides a comprehensive overview of Apache Iceberg, covering its origins, key features, practical Spark and Flink code examples, notable deployments at Alibaba and Tencent, and its future role as a universal table format for big‑data analytics.

Big Data Technology & Architecture

Feb 2, 2021

An Introduction to Apache Iceberg: Features, Spark & Flink Integration, and Real‑World Use Cases

The article explains why table‑format solutions like Apache Iceberg emerged to address data‑format incompatibilities across analytics engines, describing Iceberg as an open table format created by Netflix and now an Apache top‑level project.

Key features of Iceberg include:

Schema evolution without side effects

Hidden partitioning to avoid user errors

Partition layout evolution

Snapshot isolation for repeatable queries

Version rollback for quick issue correction

Fast data scanning without a distributed SQL engine

Data‑pruning using partition and column statistics

Broad compatibility with cloud storage and HDFS

Transactional guarantees with atomic table changes

High‑concurrency writes using optimistic concurrency

These capabilities directly address common pain points such as ACID compliance, multi‑version support, batch/stream read‑write, and support for multiple analytics engines.

Sample Spark integration code:

import org.apache.iceberg.catalog.TableIdentifier
import org.apache.iceberg.spark.SparkSchemaUtil
val catalog = new HiveCatalog(spark.sparkContext.hadoopConfiguration)
val data = Seq((1, "a"), (2, "b"), (3, "c")).toDF("id", "data")
val schema = SparkSchemaUtil.convert(data.schema)
val name = TableIdentifier.of("default", "test_table")
val table = catalog.createTable(name, schema)

// write
data.write.format("iceberg").mode("append").save("default.test_table")
// read
spark.read.format("iceberg").load("default.test_table")

Equivalent SQL usage:

spark.read.format("iceberg").load("default.test_table").createOrReplaceTempView("test_table")
spark.sql("""SELECT count(1) FROM test_table""")

Flink sink example (simplified):

// Configure Hive catalog
org.apache.hadoop.conf.Configuration hadoopConf = new org.apache.hadoop.conf.Configuration();
hadoopConf.set(org.apache.hadoop.hive.conf.HiveConf.ConfVars.METASTOREURIS.varname, META_STORE_URIS);

Catalog icebergCatalog = new HiveCatalog(hadoopConf);

// Create Iceberg table
Schema schema = new Schema(...);
PartitionSpec partitionSpec = builderFor(schema)...;
TableIdentifier tableIdentifier = TableIdentifier.of(DATABASE_NAME, TABLE_NAME);
icebergCatalog.createTable(tableIdentifier, schema, partitionSpec);

// Flink execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(...);

DataStream<Map<String, Object>> dataStream = env.addSource(source, typeInformation);

Configuration conf = new Configuration();
conf.setString(org.apache.hadoop.hive.conf.HiveConf.ConfVars.METASTOREWAREHOUSE.varname, META_STORE_URIS);
conf.setString(IcebergConnectorConstant.DATABASE, DATABASE_NAME);
conf.setString(IcebergConnectorConstant.TABLE, TABLE_NAME);

IcebergSinkAppender<Map<String, Object>> appender = new IcebergSinkAppender<>(conf, "test")
    .withSerializer(MapAvroSerializer.getInstance())
    .withWriterParallelism(1);
appender.append(dataStream);

env.execute("Sink Test");

Real‑world deployments are highlighted, including Alibaba's Lambda‑architecture combining full‑load Iceberg tables with Kafka incremental streams, and Tencent's near‑real‑time data‑warehouse that replaces Kafka with Iceberg snapshots for streaming reads, dramatically reducing latency.

Looking ahead, Iceberg is positioned to become a universal table‑format layer, decoupling storage from compute and aligning perfectly with Flink’s stream‑batch unified processing model.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Data Lake Spark Apache Iceberg table format

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.