Understanding Stream‑Batch Integration in Modern Data Engineering
The article explains the rise, challenges, and practical approaches of the stream‑batch integration concept—originally popularized by the Flink community—highlighting why it struggles at large scale, how companies adopt Kappa‑style real‑time pipelines or unified storage‑compute engines, and its relevance in technical interviews.
In 2021 and 2022 the notion of “stream‑batch integration” became widely discussed in the data development community, with many people able to comment on it regardless of their expertise.
Although the topic has faded from mainstream conversation, it still appears in many interview questions, prompting candidates to wonder how to answer.
The core idea is to have a single codebase that can run both batch and streaming workloads.
This concept was first introduced by the Flink community, aiming to use the same Flink Batch and Flink Streaming code for offline and real‑time calculations, thereby addressing data consistency and metric alignment issues.
While the idea sounds attractive, it assumes that Flink can efficiently and reliably support both offline and real‑time pipelines.
At small data volumes or limited business scale, the approach works without major problems.
However, when data volume and business scale grow large, the approach becomes impractical, and only a handful of companies have successfully implemented it in production.
The concept is not dead; instead, practitioners can shift focus from unifying the compute engine to unifying the data side, specifically the data output.
Two main implementation patterns have emerged:
First pattern – integrating with Kappa architecture and unifying data output on the real‑time side
Some leading companies adopt a Kappa‑style design where real‑time processing is the primary path, and data is synchronized to offline storage via a mature Kafka → HDFS pipeline, providing stable, low‑latency data for downstream analytics.
Second pattern – unifying storage and compute engines to support both stream and batch
A few companies develop custom storage engines that can serve both streaming reads (e.g., Flink SQL) and batch reads (e.g., Spark SQL), ensuring data originates from a single source; however, differences in semantics—especially state handling—limit the approach to specific scenarios.
Despite limited adoption, the approach continues to evolve quietly within large enterprises, offering valuable production practices for those who can apply it.
Consequently, interviewers still ask candidates about real‑world implementations of stream‑batch integration, reflecting its ongoing relevance in the data engineering field.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
