Big Data 10 min read

Step-by-Step Guide to Installing and Using Flink with Iceberg for Real-Time Data Lake

This article provides a comprehensive tutorial on setting up Flink 1.11 with Iceberg 0.11.1, creating Hive catalogs, building databases and tables, inserting data, and exploring Iceberg components, file structures, partitioned tables, execution plans, and programmatic access via Scala.

Big Data Technology & Architecture

Jul 27, 2022

Step-by-Step Guide to Installing and Using Flink with Iceberg for Real-Time Data Lake

Overview

Flink is a real‑time computation engine that unifies stream and batch processing, reducing the need for separate platforms; combined with a data‑lake solution such as Iceberg it can satisfy both real‑time and offline warehousing requirements.

Installation

The environment is built on Flink 1.11.x and Iceberg 0.11.1, which are compatible with Hadoop and Hive.

1. Install Flink

Download the Flink binary, extract it, set Hadoop classpath, and start the cluster.

wget https://downloads.apache.org/flink/flink-1.11.1/flink-1.11.1-bin-scala_2.12.tgz
tar xzvf flink-1.11.1-bin-scala_2.12.tgz
export HADOOP_CLASSPATH=$HADOOP_HOME/bin/hadoop classpath
./bin/start-cluster.sh

2. Download Iceberg jars

Obtain iceberg-flink-runtime-0.11.1.jar and flink-sql-connector-hive-2.3.6_2.11-1.11.0.jar from the Maven repository.

3. Start Flink‑SQL

./bin/sql-client.sh embedded
-j /iceberg-flink-runtime-xxx.jar
-j /flink-sql-connector-hive-2.3.6_2.11-1.11.0.jar
shell

4. Create Hive Catalog

CREATE CATALOG hive_catalog WITH (
  'type'='iceberg',
  'catalog-type'='hive',
  'uri'='thrift://server1:9083',
  'clients'='5',
  'property-version'='1',
  'warehouse'='hdfs://server1/user/hive/warehouse'
);

5. Create Database and Table

create iceberg_db;
use iceberg_db;
CREATE TABLE test (
    id BIGINT COMMENT 'unique id',
    busi_date STRING
);

6. Insert Data and Observe Job

Execute INSERT statements via Flink‑SQL; the corresponding Flink job and checkpoints can be viewed in the UI.

7. Iceberg Components

IcebergStreamWriter writes records to Avro/Parquet/ORC files, while IcebergFilesCommitter collects DataFiles at each checkpoint and commits a transaction.

8. Iceberg File Structure

Iceberg stores data under data and metadata under metadata, including snapshot, manifest list, and manifest files.

9. Partitioned Table Example

CREATE TABLE t_partition (
    id BIGINT COMMENT 'unique id',
    busi_date STRING
) PARTITIONED BY (busi_date);

10. Execution Plan and Programmatic Access

A Scala program demonstrates how to create the Hive catalog, list databases and tables, and query the test table using the Flink Table API.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Real-time Processing Flink SQL Data Lake Hadoop Iceberg

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.