Comprehensive 2021 Flink Interview Questions and Answers
This article presents a detailed collection of 2021 Flink interview questions covering checkpoint mechanisms, watermarks, state backends, join types, fault tolerance, resource configuration, and recent Flink 1.10 features, providing concise explanations and code examples for each topic.
1. How does Flink guarantee exactly‑once consumption?
Flink relies on two mechanisms: the Checkpoint mechanism, which inserts a barrier into the data stream to snapshot operator state, and the two‑phase commit mechanism, implemented via CheckpointedFunction and CheckpointListener, to ensure atomic state updates.
2. Differences between Flink and Spark
Both support batch and stream processing, but Flink treats streams as first‑class citizens with true event‑time processing and lower latency, while Spark’s streaming is micro‑batch based. Flink also often outperforms Spark in iterative batch workloads.
3. What can Flink state be used for?
Flink state is used for checkpoint‑based recovery and for implementing custom logical computations.
4. Watermark mechanism
Watermarks handle out‑of‑order events by defining a delay; for example, with a 5‑second window and a 2‑second watermark, computation triggers when event time exceeds window end plus watermark.
5. Time semantics in Flink
Flink distinguishes Event Time (when the event occurred), Ingestion Time (when the event entered Flink), and Processing Time (when the event is processed by an operator).
6. Flink window join types
Includes window join, co‑group (left/right join), and interval join which requires event time, watermarks, and a defined offset.
7. Common window functions
Tumbling, Sliding, Session, and Count windows.
8. How does keyedProcessFunction work with event time?
It checks whether the current watermark exceeds the trigger time; if so, the function processes the event, otherwise it waits.
9. Handling offline data in Flink
Techniques include async I/O, broadcast state, cache‑augmented async I/O, and periodic refresh in the source’s open method.
10. Supported data types
DataSet API, DataStream API, and Table API.
11. Detecting data skew
Inspect Flink Web UI for uneven subtask data volumes; skew often originates from hot keys in Kafka or imbalanced aggregation operators.
12. Dimension table joins (broadcast join)
Broadcast state distributes dimension data to all downstream tasks, enabling joins via a connected stream.
13. Checkpoint timeout troubleshooting
Check network issues, barrier problems, and data skew via the Web UI.
14. Real‑time Top‑N vs offline Top‑N
Real‑time Top‑N must continuously maintain an in‑memory structure that updates with each incoming record.
15. Checkpoint differences between Spark Streaming and Flink
Spark Streaming checkpoints can cause duplicate consumption, whereas Flink’s checkpoints provide exactly‑once semantics with incremental snapshots stored in memory, RocksDB, or HDFS.
16. Overview of CEP (Complex Event Processing)
CEP detects patterns over event streams, allowing complex events to be emitted when simple events match defined rules.
23. Backpressure in Flink
Backpressure is handled by a distributed blocking queue; when the queue fills, upstream operators are naturally throttled, and monitoring is done via the Flink Web UI.
24. Cost‑Based Optimizer (CBO)
Flink uses a CBO similar to databases to generate optimal logical and physical execution plans based on query cost.
30. Interval join example
DataStream<T> keyed1 = ds1.keyBy(o -> o.getString("key")
DataStream<T> keyed2 = ds2.keyBy(o -> o.getString("key")
// right timestamp -5s <= left timestamp <= right timestamp +5s
keyed1.intervalJoin(keyed2).between(Time.milliseconds(-5), Time.milliseconds(5))31. Determining parallelism and resources
Parallelism is typically aligned with Kafka topic partitions; each parallel instance may require around 3 GB of resources.
32. Broadcast join principle
Broadcast state distributes a dimension stream to all task managers, allowing it to be connected with the main event stream for joins.
36. State backend storage options
Flink supports Memory, RocksDB, and HDFS state backends.
37. Serialization differences between Spark and Flink
Spark uses Java serialization (or Kryo), while Flink implements its own TypeInformation‑based serialization.
38. Handling late data
Watermarks define allowed lateness; in production, strict lateness handling may involve discarding late events or adjusting watermark thresholds.
43. State backend mechanisms
State backends persist snapshots during checkpoints to Memory, HDFS, or RocksDB.
45. Flink vs Spark for batch processing
Flink aims to unify batch and stream processing, offering Hive integration and SQL support, though Spark still leads in raw batch performance.
58. New features in Flink 1.10
Improvements include enhanced memory management, managed memory extensions, simplified RocksDB configuration, unified job submission logic, native Kubernetes integration, and advanced Hive Table API/SQL support with partition pruning and push‑down optimizations.
59. Restart strategies
Flink provides fixed‑delay, failure‑rate, no‑restart, and fallback strategies to handle job failures.
60. When to use aggregate() vs process()
Use aggregate() for incremental aggregation and process() for full‑window computations such as sorting.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
