Step-by-Step Guide to Installing and Using Flink with Iceberg for Real-Time Data Lake
This article provides a comprehensive tutorial on setting up Flink 1.11 with Iceberg 0.11.1, creating Hive catalogs, building databases and tables, inserting data, and exploring Iceberg components, file structures, partitioned tables, execution plans, and programmatic access via Scala.
Overview
Flink is a real‑time computation engine that unifies stream and batch processing, reducing the need for separate platforms; combined with a data‑lake solution such as Iceberg it can satisfy both real‑time and offline warehousing requirements.
Installation
The environment is built on Flink 1.11.x and Iceberg 0.11.1, which are compatible with Hadoop and Hive.
1. Install Flink
Download the Flink binary, extract it, set Hadoop classpath, and start the cluster.
wget https://downloads.apache.org/flink/flink-1.11.1/flink-1.11.1-bin-scala_2.12.tgz
tar xzvf flink-1.11.1-bin-scala_2.12.tgz
export HADOOP_CLASSPATH=$HADOOP_HOME/bin/hadoop classpath
./bin/start-cluster.sh2. Download Iceberg jars
Obtain iceberg-flink-runtime-0.11.1.jar and flink-sql-connector-hive-2.3.6_2.11-1.11.0.jar from the Maven repository.
3. Start Flink‑SQL
./bin/sql-client.sh embedded
-j /iceberg-flink-runtime-xxx.jar
-j /flink-sql-connector-hive-2.3.6_2.11-1.11.0.jar
shell4. Create Hive Catalog
CREATE CATALOG hive_catalog WITH (
'type'='iceberg',
'catalog-type'='hive',
'uri'='thrift://server1:9083',
'clients'='5',
'property-version'='1',
'warehouse'='hdfs://server1/user/hive/warehouse'
);5. Create Database and Table
create iceberg_db;
use iceberg_db;
CREATE TABLE test (
id BIGINT COMMENT 'unique id',
busi_date STRING
);6. Insert Data and Observe Job
Execute INSERT statements via Flink‑SQL; the corresponding Flink job and checkpoints can be viewed in the UI.
7. Iceberg Components
IcebergStreamWriter writes records to Avro/Parquet/ORC files, while IcebergFilesCommitter collects DataFiles at each checkpoint and commits a transaction.
8. Iceberg File Structure
Iceberg stores data under data and metadata under metadata, including snapshot, manifest list, and manifest files.
9. Partitioned Table Example
CREATE TABLE t_partition (
id BIGINT COMMENT 'unique id',
busi_date STRING
) PARTITIONED BY (busi_date);10. Execution Plan and Programmatic Access
A Scala program demonstrates how to create the Hive catalog, list databases and tables, and query the test table using the Flink Table API.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
