Big Data 15 min read

Choosing Between Lambda and Kappa: Real‑Time Data Warehouse Strategies

The article uses an acorn‑moving analogy to highlight latency and traceability challenges in enterprise data warehouses, then explains offline versus real‑time approaches, compares Lambda and Kappa architectures, discusses Iceberg integration, and shares a detailed e‑commerce real‑time warehouse case study with optimization tips.

JavaEdge
JavaEdge
JavaEdge
Choosing Between Lambda and Kappa: Real‑Time Data Warehouse Strategies

Motivation

Batch‑oriented pipelines cause two main problems:

Data is only available after a large batch is collected, preventing timely access to specific records.

Even with real‑time ingestion, locating previously processed items is difficult without proper indexing.

Enterprise Real‑Time Data Warehouse Architecture

Enterprises typically ingest heterogeneous data sources (logs, business data) and initially adopt a siloed "pipeline" approach, which leads to high code coupling, duplicate development, resource waste, and monitoring challenges. A layered architecture mitigates these issues by separating data into hierarchical layers built bottom‑up.

Data sources : log data and structured/unstructured business data.

Warehouse types : offline (batch) and real‑time.

Technology stack :

Ingestion: Sqoop, Flume, CDC

Storage: Hive, HBase, MySQL, Kafka, data lake

Processing: Hive, Spark, Flink

OLAP: Kylin, ClickHouse, Elasticsearch, DorisDB

Offline Warehouse (Hive)

Typical workflow for a daily PV/UV report:

Store raw data in Hive ODS layer.

Clean and enrich data in DWD layer.

Perform dimensional modeling and aggregation in DWS layer.

Schedule T+1 jobs with DolphinScheduler.

Query via an OLAP engine and generate reports.

Advantages: simple stack, high Hive storage performance, suitable for interactive queries.

Drawback: Hive’s batch execution incurs latency.

Accelerating Offline Processing with Spark

Replacing Hive with Spark for the ODS→DWD→DWS pipeline reduces processing time because Spark operates in‑memory. Memory must be sized appropriately to avoid OOM.

Real‑Time Warehouse Architectures

Two common patterns are Lambda and Kappa.

Lambda Architecture

Maintains parallel batch and stream pipelines, producing both real‑time and offline warehouses.

Core stack: Flink, Kafka, Hive.

Stream flow: Kafka ODS → Flink clean → Kafka DWD → Flink aggregate → Kafka DWS.

Batch flow mirrors the offline Hive pipeline.

Realtime OLAP queries are served from the stream side.

Pros : guarantees timeliness and historical completeness; provides data redundancy and traceability.

Cons : duplicate pipelines increase cost, cause data redundancy, and may lead to inconsistencies.

Improvement: periodically refresh each real‑time layer’s results into the offline warehouse, eliminating duplicate source reads and reducing runtime.

Kappa Architecture

Uses a single stream pipeline, removing the batch side.

Core stack: Flink, Kafka.

Stream flow identical to Lambda’s real‑time part.

No offline warehouse is built.

Realtime OLAP queries require connectors because Kafka is not OLAP‑friendly.

Pros : lower maintenance cost, strong real‑time performance, data processed only once.

Cons : historical back‑track is difficult (Kafka may discard old data); OLAP queries on Kafka need additional tooling.

Data Lake Integration with Apache Iceberg

Kafka is unsuitable for long‑term storage and analytical queries. Replacing Kafka with a table‑format lake such as Apache Iceberg (or Hudi) satisfies the following requirements:

Support for data back‑track and updates.

Unified batch‑stream read/write with real‑time ingestion.

Iceberg provides snapshot‑based read/write separation, ACID semantics, schema and partition evolution, and full compatibility with Flink.

Upgrading Kappa to Iceberg

Workflow after the upgrade:

Store raw data in an Iceberg ODS table instead of Kafka.

Flink reads source events and writes to Iceberg ODS.

Continue the ODS→DWD→DWS processing chain, persisting each layer in Iceberg.

Iceberg enables unified stream‑batch queries and serves OLAP workloads.

E‑Commerce Real‑Time Warehouse Case Study

Architecture Overview

The system combines Flink, Spark, and Kafka to provide unified data services for e‑commerce.

Data ingestion uses Flink CDC to capture changes from business systems and third‑party event tracking.

{
  "data": [
    {
      "id": "13",
      "order_id": "6BB4837EB74E4568DDA7DC67ED2CA2AD9",
      "order_code": "order_x001",
      "price": "135.00"
    }
  ]
}

CREATE TABLE order_detail_table (
  id BIGINT,
  order_id STRING,
  order_code STRING,
  price DECIMAL(10, 2)
) WITH (
  'connector' = 'kafka',
  'topic' = 'order_binlog',
  'properties.bootstrap.servers' = 'localhost:9092',
  'properties.group.id' = 'group001',
  'canal-json.ignore-parse-errors' = 'true'
);

Data is transformed into three real‑time models corresponding to DWD (detail), DWS (light aggregation), and ADS (heavy aggregation) layers.

Initially Spark Streaming + Kafka was used; later the stack switched to Flink + Kafka to achieve sub‑second latency.

Processed results are persisted to downstream stores such as Elasticsearch, Redis, MySQL, and Kafka, and are exposed via APIs for downstream applications (recommendation, user profiling, data mining, marketing dashboards).

val esServices = new EsHandler[BaseHandler](dataFlows)
val kafkaServices = new KafkaHandler[BaseHandler](dataFlows)
val redisServices = new RedisHandler[BaseHandler](dataFlows)
val jdbcServices = new JDBCHandler[BaseHandler](dataFlows)

esServices.handle(args)
kafkaServices.handle(args)
redisServices.handle(args)
jdbcServices.handle(args)

Data Flow

End‑to‑end flow: collection → processing (Flink) → aggregation (DWD/DWS/ADS) → storage (ES, Redis, MySQL, Kafka) → API consumption.

Operational Practices

Handling Data Loss

If source offsets are lost, re‑consume from the correct offset. For delayed window data, increase the window latency, use side‑output streams, or persist delayed records to a durable store for later processing.

Deduplication Strategies

In‑memory state or bitmap for low‑volume streams.

Bloom filter or HyperLogLog for larger streams.

External key‑value stores (Redis, HBase) for persistent deduplication.

Joining Multiple Real‑Time Streams

Flink provides join operators: regular join, window join, interval join, and connect.

Job Scheduling

Tag YARN jobs to separate real‑time and batch workloads, and tune YARN container allocation parameters accordingly.

Architecture Recommendation

For small‑to‑medium projects that require full historical data, Lambda architecture is recommended due to its robustness and completeness. Kappa is suitable only when strict real‑time performance outweighs the need for historical back‑track.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkreal-time data warehouseIcebergLambda architectureKappa architecture
JavaEdge
Written by

JavaEdge

First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.