Big Data 15 min read

From Lambda to Lakehouse: Evolution of Real‑Time Data Warehouses with Hologres & Flink

This article traces the three‑generation evolution of real‑time data warehouses—from the Lambda architecture to a lakehouse approach—detailing how Hologres, Flink, and Dynamic Table technologies enable unified storage, multi‑mode computing, serverless execution, and high‑performance analytics in modern big‑data environments.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
From Lambda to Lakehouse: Evolution of Real‑Time Data Warehouses with Hologres & Flink

Abstract: This presentation, based on Jiang Weihua’s talk at the 2024 FFA Forum, outlines the development of real‑time data warehouses in four parts: the evolution of real‑time warehouses, the shift to real‑time lakehouses, a summary, and future architectural considerations.

1. First Generation: Lambda Architecture

Initially, real‑time warehouses used a Lambda architecture with separate offline (Hive/MaxCompute) and real‑time (Kafka → Flink → KV stores such as MySQL, HBase, Redis) pipelines. Offline processing corrected real‑time results the next day, but this design suffered from duplicated data sources, complex logic, high operational costs, and a siloed "chimney" structure.

2. Second Generation: Kafka‑Based Layered Real‑Time Warehouse + OLAP

The second generation introduced layered real‑time processing (DWD, DWS, ADS) using Kafka as an intermediate bus, with each layer consumed by Flink and written to the next Kafka topic. While this enabled reuse, Kafka’s consumption model limited queryability and correction. The addition of an OLAP engine (Hologres) provided richer query capabilities and easier back‑tracking.

3. Third Generation: Hologres‑Centric Layered Warehouse + Integrated Analytics Service

To overcome Kafka’s limitations, the third generation replaced Kafka with Hologres. All layers (DWD, DWS, ADS) are stored in Hologres, with Flink consuming Hologres Binlog for continuous processing. This unified storage simplifies queries, updates, and data lineage.

The resulting "real‑time warehouse sandwich" architecture streams data through Flink into Hologres ODS, then generates DWD and DWS layers via Binlog‑driven processing, achieving second‑level end‑to‑end latency and consistent layering.

Future Evolution: From Real‑Time Warehouse to Real‑Time Lakehouse

In lakehouse scenarios, the key challenge is unifying real‑time and offline data access. The goal is a single storage layer accessed by the same SQL, eliminating Lambda inconsistencies. Recent work combines Hologres, Flink, and Paimon to achieve this.

Lakehouse Metadata Management

Hologres 3.0 introduces the External Database concept, mapping a PostgreSQL‑compatible database to a lake catalog (e.g., Paimon) or MaxCompute project, allowing BI tools to query lake metadata without manual import.

High‑Performance Lake Queries

Hologres delivers high‑performance queries on lake data, especially for Paimon and MySQL tables, outperforming Trino by up to 6‑10× depending on data placement.

Real‑Time Lakehouse Layered Architecture

Dynamic Table, a new Hologres 3.0 feature, supports unified batch, incremental, and streaming computation. Users write a single SQL (CTAS‑style) that can be scheduled periodically (full, incremental, or streaming) with minimal syntax differences, enabling seamless Lambda‑style logic reuse.

Dynamic Table also integrates Serverless execution, allowing individual queries to run on demand without consuming dedicated instance resources, reducing cost and improving isolation.

Demo: Hologres + Flink + Paimon Real‑Time Lakehouse

A demo shows how GitHub Events data is ingested via Flink into a Paimon lake table, then processed with Dynamic Table to build layered analytics, and optionally written back to the lake or Hologres.

Summary

Flink + Hologres enables layered real‑time processing but remains a Lambda architecture. With Dynamic Table and serverless capabilities, users can choose full‑warehouse, lake‑warehouse, or pure lake solutions based on cost, performance, latency, and sharing requirements. Hologres continues to evolve from a real‑time warehouse to an integrated real‑time lakehouse, supporting diverse workloads across cloud and on‑prem environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

FlinkHologresDynamic Tablereal-time data warehouse
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.