From BI to Kappa: How Data Architecture Evolved in the Big Data Era
This article traces the evolution of data architecture from early BI systems through traditional big‑data stacks, streaming, Lambda and Kappa designs, and explains how a unified stream‑batch model simplifies development while keeping logic consistent across data‑analysis and pipeline applications.
1. Pre‑Big Data Era
Before the big‑data era, BI (Business Intelligence) systems were the primary tools for data analysis. Early BI combined data cleaning, analysis, mining, and reporting, often built on relational databases and using cube models with MDX queries. Limitations included reliance on traditional RDBMS constraints and lack of support for unstructured data.
2. Evolution of Big Data Architecture
Traditional Big Data Architecture
To handle massive data volumes, companies adopted distributed storage and compute, most famously Google’s file system and MapReduce, later open‑sourced as Hadoop. A typical pipeline reads from sources (e.g., MySQL, flat files), passes through ODS, DWD, ADS layers, and finally serves consumers.
Streaming Architecture
Streaming architecture discards the offline ETL chain, letting a stream engine consume incremental data directly from business databases and produce real‑time results. Early implementations struggled to guarantee both low latency and high accuracy, limiting their use to scenarios where precision was less critical.
Lambda Architecture
Lambda combines a real‑time stream layer with a batch layer that processes daily snapshots, merging their outputs to achieve low latency and eventual consistency. However, it requires maintaining two separate codebases and systems, leading to high development and operational costs and potential inconsistencies.
Kappa Architecture
Kappa removes the batch layer, relying solely on a distributed message queue (e.g., Kafka) and a single stream processing engine. When data errors or logic changes occur, the system replays the queue to recompute results. While it simplifies the stack, its applicability is limited by the short lifecycle of queued data.
3. Stream‑Batch Unified Model and Data Architecture
The stream‑batch unified model lets a single codebase (Java or SQL) be executed either as a streaming job (incremental) or a batch job (full snapshot) based on configuration or automatic detection, producing identical results. This capability is valuable for both data‑analysis and data‑pipeline applications.
Data‑analysis Applications
Most data‑analysis workloads benefit from combining real‑time and batch processing, effectively implementing a refined Lambda architecture where the same logic runs in both modes, reducing duplication and ensuring consistent semantics.
Data‑pipeline Applications
For data synchronization tasks, the unified engine can first perform a batch load and then switch to a streaming mode to capture ongoing changes, using connectors (e.g., Flink CDC) to keep source and target in sync without separate tools.
4. Summary
Early BI → Traditional big‑data architecture (solves volume, high latency)
Traditional → Streaming (low latency, lower accuracy)
Streaming → Lambda (adds batch for accuracy, but complex)
Lambda → Kappa (removes redundancy, but limited by message‑queue lifespan)
By adopting a stream‑batch unified engine on top of Lambda‑style designs, system complexity is reduced and computation logic remains consistent across real‑time and batch workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
