Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees
This article explains how to develop a real‑time ETL application using Apache Flink that reads events from Kafka, partitions them by event time into HDFS directories, and achieves exactly‑once processing through checkpointing, custom bucket assigners, and proper state backend configuration.
Apache Flink is an emerging big‑data framework that differs from Spark by using stream processing to simulate batch jobs, providing sub‑second latency and exactly‑once semantics. One common use case is building a real‑time data pipeline that moves and transforms data between storage systems.
The tutorial demonstrates how to create a Flink project with Maven, add the Kafka connector, and write a program that consumes JSON events from a Kafka topic, extracts the timestamp, and writes the records to HDFS using StreamingFileSink with a custom BucketAssigner that creates daily partitions based on event time.
mvn archetype:generate \
-DarchetypeGroupId=org.apache.flink \
-DarchetypeArtifactId=flink-quickstart-java \
-DarchetypeVersion=1.7.0Kafka source configuration:
Properties props = new Properties();
props.setProperty("bootstrap.servers", "localhost:9092");
FlinkKafkaConsumer010<String> consumer = new FlinkKafkaConsumer010<>(
"flink_test", new SimpleStringSchema(), props);
DataStream<String> stream = env.addSource(consumer);Custom bucket assigner (event‑time partitioning):
public class EventTimeBucketAssigner implements BucketAssigner<String, String> {
@Override
public String getBucketId(String element, Context context) {
JsonNode node = mapper.readTree(element);
long date = (long) (node.path("timestamp").floatValue() * 1000);
String partitionValue = new SimpleDateFormat("yyyyMMdd").format(new Date(date));
return "dt=" + partitionValue;
}
}Sink definition using the custom assigner:
StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path("/tmp/kafka-loader"), new SimpleStringEncoder<>())
.withBucketAssigner(new EventTimeBucketAssigner())
.build();
stream.addSink(sink);To achieve exactly‑once semantics, checkpointing is enabled and the state backend is switched to a filesystem (or RocksDB) backend:
env.enableCheckpointing(60_000);
env.setStateBackend(new FsStateBackend("/tmp/flink/checkpoints"));
env.getCheckpointConfig().enableExternalizedCheckpoints(
ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION);Savepoints can be used to stop, upgrade, and resume jobs without data loss. Example commands for creating a savepoint and restarting from it are provided, as well as instructions for submitting the job to a YARN cluster.
The article also explains how Flink guarantees exactly‑once processing: checkpoints are based on the Chandy‑Lamport algorithm, barriers align streams, and state (including Kafka offsets and in‑progress file handles) is persisted. Upon failure, Flink restores state, re‑plays source data from the last completed checkpoint, and truncates partially written files to ensure no duplicate records.
In conclusion, Flink’s design, strong checkpointing model, and seamless integration with the Hadoop ecosystem make it a competitive choice for real‑time big‑data workloads, and its ecosystem continues to grow with Table API, streaming SQL, and machine‑learning extensions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
