Big Data 11 min read

Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees

This article explains how to develop a real‑time ETL application using Apache Flink that reads events from Kafka, partitions them by event time into HDFS directories, and achieves exactly‑once processing through checkpointing, custom bucket assigners, and proper state backend configuration.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees

Apache Flink is an emerging big‑data framework that differs from Spark by using stream processing to simulate batch jobs, providing sub‑second latency and exactly‑once semantics. One common use case is building a real‑time data pipeline that moves and transforms data between storage systems.

The tutorial demonstrates how to create a Flink project with Maven, add the Kafka connector, and write a program that consumes JSON events from a Kafka topic, extracts the timestamp, and writes the records to HDFS using StreamingFileSink with a custom BucketAssigner that creates daily partitions based on event time.

mvn archetype:generate \
  -DarchetypeGroupId=org.apache.flink \
  -DarchetypeArtifactId=flink-quickstart-java \
  -DarchetypeVersion=1.7.0

Kafka source configuration:

Properties props = new Properties();
props.setProperty("bootstrap.servers", "localhost:9092");
FlinkKafkaConsumer010<String> consumer = new FlinkKafkaConsumer010<>(
    "flink_test", new SimpleStringSchema(), props);
DataStream<String> stream = env.addSource(consumer);

Custom bucket assigner (event‑time partitioning):

public class EventTimeBucketAssigner implements BucketAssigner<String, String> {
  @Override
  public String getBucketId(String element, Context context) {
    JsonNode node = mapper.readTree(element);
    long date = (long) (node.path("timestamp").floatValue() * 1000);
    String partitionValue = new SimpleDateFormat("yyyyMMdd").format(new Date(date));
    return "dt=" + partitionValue;
  }
}

Sink definition using the custom assigner:

StreamingFileSink<String> sink = StreamingFileSink
    .forRowFormat(new Path("/tmp/kafka-loader"), new SimpleStringEncoder<>())
    .withBucketAssigner(new EventTimeBucketAssigner())
    .build();
stream.addSink(sink);

To achieve exactly‑once semantics, checkpointing is enabled and the state backend is switched to a filesystem (or RocksDB) backend:

env.enableCheckpointing(60_000);
env.setStateBackend(new FsStateBackend("/tmp/flink/checkpoints"));
env.getCheckpointConfig().enableExternalizedCheckpoints(
    ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION);

Savepoints can be used to stop, upgrade, and resume jobs without data loss. Example commands for creating a savepoint and restarting from it are provided, as well as instructions for submitting the job to a YARN cluster.

The article also explains how Flink guarantees exactly‑once processing: checkpoints are based on the Chandy‑Lamport algorithm, barriers align streams, and state (including Kafka offsets and in‑progress file handles) is persisted. Upon failure, Flink restores state, re‑plays source data from the last completed checkpoint, and truncates partially written files to ensure no duplicate records.

In conclusion, Flink’s design, strong checkpointing model, and seamless integration with the Hadoop ecosystem make it a competitive choice for real‑time big‑data workloads, and its ecosystem continues to grow with Table API, streaming SQL, and machine‑learning extensions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataApache FlinkStreamingKafkareal-time ETLHDFSExactly-Once
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.