Comprehensive Overview of Apache Flink Streaming Computation and Architecture
The article systematically introduces Apache Flink’s streaming computation model, contrasting batch and real‑time processing, detailing its unified architecture, managed and raw state with key groups, checkpointing and savepoints for fault tolerance, data exchange mechanisms, time semantics, windowing, side‑outputs, and a complete Java Kafka‑based example.
This article provides a systematic introduction to stream processing in the big‑data ecosystem, focusing on the concepts, architecture, and core mechanisms of Apache Flink.
1. Batch vs. Stream Processing – Batch (offline) processing handles bounded static datasets using frameworks such as Hadoop MapReduce and Hive, while stream (real‑time) processing deals with unbounded data streams using systems like Storm, Spark Streaming, and Flink.
2. Stream vs. Batch Characteristics – Batch jobs operate on finite, persistent, and massive data sets, whereas stream jobs process data that arrives continuously, may be non‑persistent, and require low‑latency results.
3. Flink Overview – Flink is a distributed stream processing engine written in Java/Scala. It can execute both batch and streaming jobs, supports parallel execution, and provides a unified runtime.
4. State Management – Flink distinguishes between Managed State (handled by Flink) and Raw State (user‑managed). Managed state is further divided into Keyed State (per key) and Operator State (per operator instance). The article explains how keyed state is partitioned by keys and how operator state is used for sources/sinks.
5. State Scaling and Key Groups – To support rescaling, Flink introduces Key Groups as the atomic unit for redistributing keyed state. Keys are hashed twice and assigned to a key group in the range [0, maxParallelism‑1] . Sub‑tasks receive one or more contiguous key‑group ranges.
6. Fault Tolerance – Checkpoint & Savepoint – Flink periodically inserts Checkpoint Barriers into the data stream to snapshot operator state. The barrier propagation follows a receiver‑initiated “pull” model. The article details the alignment process, asynchronous snapshotting, and the difference between automatic checkpoints (fault‑tolerance) and user‑triggered savepoints (state migration).
7. Data Exchange – Flink’s data exchange consists of a control flow (receiver‑initiated) and a data flow realized via IntermediateResult abstractions. JobManager stores the execution graph, while TaskManagers exchange data through TCP connections, using ResultPartition and ResultSubpartition structures.
8. Time Semantics – Flink supports three time notions: Processing Time, Event Time, and Ingestion Time. Event‑time processing relies on watermarks to progress event‑time and trigger window computations.
9. Windows – Various window types are described: Tumbling (fixed), Sliding, Session, and advanced windows. The article shows how windows slice unbounded streams into bounded chunks for aggregation.
10. Side Outputs (旁路流) – Side outputs allow a stream to emit data to a separate logical stream using OutputTag . Example code demonstrates defining an OutputTag , emitting to it via context.output() , and retrieving the side stream with getSideOutput() .
11. Example Flink Job – A complete Java example is provided: it reads from Kafka, extracts event timestamps, applies a sliding window of 10 s with a 5 s slide, counts occurrences per key, and writes results back to Kafka. The code includes Kafka source/sink configuration, timestamp extraction, window aggregation, and the main execution environment setup.
public class FlinkJobDemo { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); env.getConfig().setAutoWatermarkInterval(1000); // Kafka source FlinkKafkaConsumer011 kafkaSource = getKafkaSource("broker:9092", "group", Collections.singletonList("topic")); DataStream dataStream = env.addSource(kafkaSource); // Deserialize DataStream entityDataStream = dataStream.map(data -> { KafkaMessage msg = JSON.parseObject(data, KafkaMessage.class); return JSON.parseObject(msg.getData(), T.class); }); // Assign timestamps & watermarks DataStream > keyedDataStream = entityDataStream .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor (Time.of(5, TimeUnit.SECONDS)) { @Override public long extractTimestamp(T t) { return t.getActionTs(); } }) .flatMap((T entity, Collector > out) -> out.collect(Tuple2.of(entity.getKey(), 1))); // Sliding window aggregation DataStream > windowStream = keyedDataStream .keyBy(0) .timeWindow(Time.seconds(10), Time.seconds(5)) .sum(1); // Sink to Kafka windowStream.map(data -> { Map map = new HashMap<>(); map.put("key", data.f0); map.put("count", data.f1); return JSON.toJSONString(map); }).addSink(getKafkaSink("broker:9092", "outTopic")); env.execute("FlinkJobDemo"); } // Kafka source/sink helper methods omitted for brevity @Data public static class T { private long actionTs; private String key; } }
The article concludes with a reminder that deeper mastery of Flink requires hands‑on practice and further reading.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.