Big Data 11 min read

Building a Real‑Time ETL Pipeline with Apache Flink and Ensuring Exactly‑once Semantics

This article demonstrates how to develop a real‑time ETL job using Apache Flink, covering project setup, Kafka as a source, custom bucket assigners for HDFS, checkpointing, savepoints, and deployment on YARN to achieve exactly‑once processing guarantees.

Big Data Technology & Architecture

Sep 19, 2019

Building a Real‑Time ETL Pipeline with Apache Flink and Ensuring Exactly‑once Semantics

Apache Flink is a streaming‑first framework that can simulate batch processing, providing sub‑second latency and exactly‑once semantics. The article shows how to build a real‑time ETL pipeline that reads events from Kafka, parses timestamps, and writes them to HDFS partitioned by date.

Project creation

The Flink application is written in Java 8 and can be generated with the Maven quick‑start archetype:

mvn archetype:generate \
  -DarchetypeGroupId=org.apache.flink \
  -DarchetypeArtifactId=flink-quickstart-java \
  -DarchetypeVersion=1.7.0

The generated StreamingJob class is the entry point for further development.

Kafka data source

Add the Kafka connector dependency to the pom.xml and configure the consumer:

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-kafka-0.10_${scala.binary.version}</artifactId>
  <version>${flink.version}</version>
</dependency>

Properties props = new Properties();
props.setProperty("bootstrap.servers", "localhost:9092");
FlinkKafkaConsumer010<String> consumer = new FlinkKafkaConsumer010<>(
    "flink_test", new SimpleStringSchema(), props);
DataStream<String> stream = env.addSource(consumer);

Streaming file sink with custom bucket assigner

To write records to HDFS by event date, a custom BucketAssigner parses the JSON timestamp and returns a dt=yyyyMMdd directory name:

public class EventTimeBucketAssigner implements BucketAssigner<String, String> {
  @Override
  public String getBucketId(String element, Context context) {
    JsonNode node = mapper.readTree(element);
    long date = (long) (node.path("timestamp").floatValue() * 1000);
    String partitionValue = new SimpleDateFormat("yyyyMMdd").format(new Date(date));
    return "dt=" + partitionValue;
  }
}

The sink is then created:

StreamingFileSink<String> sink = StreamingFileSink
    .forRowFormat(new Path("/tmp/kafka-loader"), new SimpleStringEncoder<>())
    .withBucketAssigner(new EventTimeBucketAssigner())
    .build();
stream.addSink(sink);

Enabling exactly‑once with checkpoints

By default Flink provides at‑least‑once semantics. To achieve exactly‑once, enable periodic checkpoints and store state in a reliable backend:

env.enableCheckpointing(60_000);
env.setStateBackend(new FsStateBackend("/tmp/flink/checkpoints"));
env.getCheckpointConfig().enableExternalizedCheckpoints(
    ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION);

Checkpoints use the Chandy‑Lamport algorithm, inserting barriers to align streams and persisting operator state. Kafka sources implement CheckpointedFunction to store offsets, allowing replay from the last completed checkpoint.

Savepoints and job management

Savepoints are manually triggered snapshots used for upgrades or restarts. Example command:

$ bin/flink cancel -s /tmp/flink/savepoints 1253cc85e5c702dbe963dd7d8d279038

Jobs can be submitted to a local cluster or to YARN. When running on YARN, prepend HDFS prefixes to file paths and use the flink run command with the -m yarn-cluster option.

How Flink guarantees exactly‑once

Flink’s checkpointing records the state of all operators and the offsets of source partitions. Upon failure, the job restores from the latest completed checkpoint, re‑opens in‑progress files, truncates them to the saved length, and continues processing without duplicate records. The StreamingFileSink therefore writes only once‑consistent output.

Conclusion

Apache Flink’s design emphasizes durable state handling and seamless integration with the Hadoop ecosystem, making it a competitive choice for real‑time big‑data processing. Its evolving features such as Table API, streaming SQL, and machine‑learning extensions further broaden its applicability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Apache Flink kafka real-time ETL Exactly-once checkpointing StreamingFileSink

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.