Big Data 8 min read

Integrating Apache Flink 1.12.2 with Apache Hudi: Batch and Streaming Modes

This article walks through downloading the required Flink and Hudi components, building Hudi for Scala 2.12, and demonstrates step‑by‑step how to create, populate, query, and update Hudi tables in both batch and streaming modes using Flink SQL, complete with code snippets and result screenshots.

Big Data Technology & Architecture

Apr 24, 2021

Integrating Apache Flink 1.12.2 with Apache Hudi: Batch and Streaming Modes

The Hudi community has recently released an integration solution for Flink, and the author tested it on the official documentation, finding the experience satisfactory despite not yet using it in production. This guide shares the practical steps for integrating Flink with Hudi.

1. Component Download

1.1 Flink 1.12.2 binary package: https://mirrors.tuna.tsinghua.edu.cn/apache/flink/flink-1.12.2/flink-1.12.2-bin-scala_2.11.tgz

1.2 Hudi source code: https://github.com/apache/hudi

git clone https://github.com/apache/hudi.git && cd hudi
mvn clean package -DskipTests
# Default builds with Scala 2.11
# For Flink 1.12.2‑2.12, rebuild with Scala 2.12
mvn clean package -DskipTests -Dscala-2.12
# The jar is located at packaging/hudi-flink-bundle/target/hudi-flink-bundle_2.12-*-SNAPSHOT.jar

1.3 Additional real‑time steps can be followed from the official quick‑start guide: https://hudi.apache.org/docs/flink-quick-start-guide.html

The official guide uses Flink 1.11.x, which the author found incompatible (see error screenshots below).

1.4 The author confirms that using Flink 1.12.2 with Hudi 0.9.0 (master) works correctly.

2. Batch Mode Implementation

2.1 Start the Flink SQL client and place the compiled hudi-flink-bundle_2.12-0.9.0-SNAPSHOT.jar (or the Scala‑2.11 version if needed) into $FLINK_HOME/lib:

export HADOOP_CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath`
./bin/sql-client.sh embedded

2.2 Create the Hudi table:

CREATE TABLE t1(
  uuid VARCHAR(20),
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
  `partition` VARCHAR(20)
) PARTITIONED BY (`partition`)
WITH (
  'connector' = 'hudi',
  'path' = 'hdfs://localhost:9000/hudi/t1',
  'table.type' = 'MERGE_ON_READ'
);

2.3 Insert sample data:

INSERT INTO t1 VALUES
  ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
  ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
  ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
  ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
  ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
  ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
  ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
  ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');

2.4 Result screenshot:

2.5 Set the result mode to tableau to view results directly in the CLI:

set execution.result-mode=tableau;

2.6 Update a record by primary key (age changes from 23 to 24):

INSERT INTO t1 VALUES ('id1','Danny',24,TIMESTAMP '1970-01-01 00:00:01','par1');

Result screenshot:

3. Streaming Read Mode

3.1 Create a table with streaming enabled:

CREATE TABLE t1(
  uuid VARCHAR(20),
  name VARCHAR(10),
  age INT,
  ts TIMESTAMP(3),
  `partition` VARCHAR(20)
) PARTITIONED BY (`partition`)
WITH (
  'connector' = 'hudi',
  'path' = 'hdfs://localhost:9000/hudi/t1',
  'table.type' = 'MERGE_ON_READ',
  'read.streaming.enabled' = 'true',
  'read.streaming.start-commit' = '20210401134557',
  'read.streaming.check-interval' = '4'
);
-- The option read.streaming.enabled=true enables streaming reads.
-- read.streaming.check-interval sets the commit monitoring interval to 4 seconds.
-- Only MERGE_ON_READ tables support streaming reads.

3.2 Query the table in streaming mode (the data shown is the batch data inserted earlier).

3.3 Insert a new row in batch mode:

INSERT INTO t1 VALUES ('id9','test',27,TIMESTAMP '1970-01-01 00:00:01','par5');

3.4 After a few seconds, the new row becomes visible in streaming mode.

References

1. https://hudi.apache.org/docs/flink-quick-start-guide.html

2. https://github.com/MyLanPangzi/flink-demo/blob/main/docs/增量型数据库探索：Flink + Hudi.md

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink SQL Streaming Batch apache Data Lake Hudi

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.