Building a Real‑Time ETL Pipeline with Apache Flink and Ensuring Exactly‑once Semantics
This article demonstrates how to develop a real‑time ETL job using Apache Flink, covering project setup, Kafka as a source, custom bucket assigners for HDFS, checkpointing, savepoints, and deployment on YARN to achieve exactly‑once processing guarantees.
Apache Flink is a streaming‑first framework that can simulate batch processing, providing sub‑second latency and exactly‑once semantics. The article shows how to build a real‑time ETL pipeline that reads events from Kafka, parses timestamps, and writes them to HDFS partitioned by date.
Project creation
The Flink application is written in Java 8 and can be generated with the Maven quick‑start archetype:
mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-java \
-DarchetypeVersion=1.7.0The generated StreamingJob class is the entry point for further development.
Kafka data source
Add the Kafka connector dependency to the pom.xml and configure the consumer:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka-0.10_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency> Properties props = new Properties();
props.setProperty("bootstrap.servers", "localhost:9092");
FlinkKafkaConsumer010<String> consumer = new FlinkKafkaConsumer010<>(
"flink_test", new SimpleStringSchema(), props);
DataStream<String> stream = env.addSource(consumer);Streaming file sink with custom bucket assigner
To write records to HDFS by event date, a custom BucketAssigner parses the JSON timestamp and returns a dt=yyyyMMdd directory name:
public class EventTimeBucketAssigner implements BucketAssigner<String, String> {
@Override
public String getBucketId(String element, Context context) {
JsonNode node = mapper.readTree(element);
long date = (long) (node.path("timestamp").floatValue() * 1000);
String partitionValue = new SimpleDateFormat("yyyyMMdd").format(new Date(date));
return "dt=" + partitionValue;
}
}The sink is then created:
StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path("/tmp/kafka-loader"), new SimpleStringEncoder<>())
.withBucketAssigner(new EventTimeBucketAssigner())
.build();
stream.addSink(sink);Enabling exactly‑once with checkpoints
By default Flink provides at‑least‑once semantics. To achieve exactly‑once, enable periodic checkpoints and store state in a reliable backend:
env.enableCheckpointing(60_000);
env.setStateBackend(new FsStateBackend("/tmp/flink/checkpoints"));
env.getCheckpointConfig().enableExternalizedCheckpoints(
ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION);Checkpoints use the Chandy‑Lamport algorithm, inserting barriers to align streams and persisting operator state. Kafka sources implement CheckpointedFunction to store offsets, allowing replay from the last completed checkpoint.
Savepoints and job management
Savepoints are manually triggered snapshots used for upgrades or restarts. Example command:
$ bin/flink cancel -s /tmp/flink/savepoints 1253cc85e5c702dbe963dd7d8d279038Jobs can be submitted to a local cluster or to YARN. When running on YARN, prepend HDFS prefixes to file paths and use the flink run command with the -m yarn-cluster option.
How Flink guarantees exactly‑once
Flink’s checkpointing records the state of all operators and the offsets of source partitions. Upon failure, the job restores from the latest completed checkpoint, re‑opens in‑progress files, truncates them to the saved length, and continues processing without duplicate records. The StreamingFileSink therefore writes only once‑consistent output.
Conclusion
Apache Flink’s design emphasizes durable state handling and seamless integration with the Hadoop ecosystem, making it a competitive choice for real‑time big‑data processing. Its evolving features such as Table API, streaming SQL, and machine‑learning extensions further broaden its applicability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
