An Introduction to Apache Iceberg: Features, Spark & Flink Integration, and Real‑World Use Cases
This article provides a comprehensive overview of Apache Iceberg, covering its origins, key features, practical Spark and Flink code examples, notable deployments at Alibaba and Tencent, and its future role as a universal table format for big‑data analytics.
The article explains why table‑format solutions like Apache Iceberg emerged to address data‑format incompatibilities across analytics engines, describing Iceberg as an open table format created by Netflix and now an Apache top‑level project.
Key features of Iceberg include:
Schema evolution without side effects
Hidden partitioning to avoid user errors
Partition layout evolution
Snapshot isolation for repeatable queries
Version rollback for quick issue correction
Fast data scanning without a distributed SQL engine
Data‑pruning using partition and column statistics
Broad compatibility with cloud storage and HDFS
Transactional guarantees with atomic table changes
High‑concurrency writes using optimistic concurrency
These capabilities directly address common pain points such as ACID compliance, multi‑version support, batch/stream read‑write, and support for multiple analytics engines.
Sample Spark integration code:
import org.apache.iceberg.catalog.TableIdentifier
import org.apache.iceberg.spark.SparkSchemaUtil
val catalog = new HiveCatalog(spark.sparkContext.hadoopConfiguration)
val data = Seq((1, "a"), (2, "b"), (3, "c")).toDF("id", "data")
val schema = SparkSchemaUtil.convert(data.schema)
val name = TableIdentifier.of("default", "test_table")
val table = catalog.createTable(name, schema)
// write
data.write.format("iceberg").mode("append").save("default.test_table")
// read
spark.read.format("iceberg").load("default.test_table")Equivalent SQL usage:
spark.read.format("iceberg").load("default.test_table").createOrReplaceTempView("test_table")
spark.sql("""SELECT count(1) FROM test_table""")Flink sink example (simplified):
// Configure Hive catalog
org.apache.hadoop.conf.Configuration hadoopConf = new org.apache.hadoop.conf.Configuration();
hadoopConf.set(org.apache.hadoop.hive.conf.HiveConf.ConfVars.METASTOREURIS.varname, META_STORE_URIS);
Catalog icebergCatalog = new HiveCatalog(hadoopConf);
// Create Iceberg table
Schema schema = new Schema(...);
PartitionSpec partitionSpec = builderFor(schema)...;
TableIdentifier tableIdentifier = TableIdentifier.of(DATABASE_NAME, TABLE_NAME);
icebergCatalog.createTable(tableIdentifier, schema, partitionSpec);
// Flink execution environment
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(...);
DataStream<Map<String, Object>> dataStream = env.addSource(source, typeInformation);
Configuration conf = new Configuration();
conf.setString(org.apache.hadoop.hive.conf.HiveConf.ConfVars.METASTOREWAREHOUSE.varname, META_STORE_URIS);
conf.setString(IcebergConnectorConstant.DATABASE, DATABASE_NAME);
conf.setString(IcebergConnectorConstant.TABLE, TABLE_NAME);
IcebergSinkAppender<Map<String, Object>> appender = new IcebergSinkAppender<>(conf, "test")
.withSerializer(MapAvroSerializer.getInstance())
.withWriterParallelism(1);
appender.append(dataStream);
env.execute("Sink Test");Real‑world deployments are highlighted, including Alibaba's Lambda‑architecture combining full‑load Iceberg tables with Kafka incremental streams, and Tencent's near‑real‑time data‑warehouse that replaces Kafka with Iceberg snapshots for streaming reads, dramatically reducing latency.
Looking ahead, Iceberg is positioned to become a universal table‑format layer, decoupling storage from compute and aligning perfectly with Flink’s stream‑batch unified processing model.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
