Big Data 16 min read

Real-Time & Offline Data Warehouse Integration: New Capabilities Explained

This article provides an overview of real-time and offline integrated data warehousing, tracing its evolution from early offline warehouses to modern cloud-native solutions, and details the latest capabilities—including multi-engine computation, data sharing between MaxCompute and Hologres, progressive computing, materialized views, and practical use cases such as telecom analytics and connected‑car scenarios.

Alibaba Cloud Big Data AI Platform

Jul 19, 2022

Real-Time & Offline Data Warehouse Integration: New Capabilities Explained

Real-Time & Offline Integrated Overview

Before discussing the real‑time/offline integration, it is useful to review two earlier talks on the evolution from offline to lakehouse architectures. The first describes the progression from first‑generation offline warehouses to second‑generation real‑time warehouses and finally to third‑generation integrated warehouses. The focus of this article is the new capabilities of the real‑time/offline integrated warehouse.

The modern big‑data warehouse has evolved from a complex, fragmented architecture to a simplified real‑time/offline integrated design. Its core combines a streaming engine with MaxCompute and Hologres for both offline and real‑time workloads, enabling data governance, batch analysis, real‑time analysis, data marts, multi‑model analytics, and online machine‑learning models, thereby delivering a one‑stop analytics platform.

This solution fits scenarios that require both real‑time and batch analysis of massive data, such as high‑throughput analytics with strict latency requirements, multi‑source streaming plus business data services, and use cases like online alerts or predictions that rely on both streams and batch data. For purely batch needs, MaxCompute alone suffices; for high‑speed real‑time needs, Flink + Hologres provides a low‑latency real‑time warehouse.

Advantages of Real-Time & Offline Integration

Data ingestion supports batch, streaming, and real‑time channels. MaxCompute offers high‑QPS writes with immediate queryability. Hologres provides high‑performance, low‑latency writes and instant query capabilities, together covering all ingestion patterns.

Computation is multi‑engine: MaxCompute handles petabyte‑scale workloads with Spark, MapReduce, and SQL. It supports both streaming (via Spark) and batch SQL, achieving second‑ to millisecond‑level latency and million‑level throughput per job.

Data sharing is achieved through direct read/write integration between MaxCompute and Hologres, allowing a single dataset to be processed by both offline and real‑time engines without data movement.

Analytical service benefits from MaxCompute’s accelerated query engine for second‑level interactive queries, while Hologres delivers sub‑second, even millisecond‑level, interactive analysis on petabyte‑scale data.

New Product Capabilities

The integrated architecture maps to the full data‑warehouse development lifecycle: source → ingestion → cleansing → business aggregation → analysis & services → AI & reporting. Supported connectors include Kafka, Logstash, and Flink. Ingestion channels cover batch, streaming, and real‑time paths, with upcoming dedicated resources and upsert capabilities for real‑time updates from relational sources.

During cleansing, MaxCompute now supports UPDATE and DELETE, materialized view creation, and progressive computation. Query acceleration (MCQA) offers free daily quotas for sub‑10 GB queries, with prepaid dedicated resources for higher‑performance needs, and optional Hologres integration for ultra‑low‑latency interactive queries.

Hologres integration enables external table reads and direct storage reads from MaxCompute, with future plans for unified metadata and direct read capabilities. Ecosystem integrations include BI tools and AI services that consume MaxCompute data.

Progressive Computing

Progressive computing processes incremental data while maintaining intermediate state, bridging traditional stream processing and batch jobs. It can aggregate hourly transaction amounts and counts, storing lightweight summaries for fast queries, reducing the need to scan raw detail tables.

This approach saves compute resources and improves query speed, as demonstrated by pipelines that write streaming data to MaxCompute for progressive aggregation or to Hologres for real‑time serving.

Materialized Views

Materialized views store query results as local copies, enabling pre‑computed summaries that refresh as often as every five minutes. They simplify queries, improve performance, and are transparent to users, supporting near‑real‑time analytics when combined with Hologres.

Integrated Architecture

The simplified architecture routes data from DataHub to MaxCompute for batch processing and to Flink for real‑time consumption, which writes to Hologres. Two paths exist: a low‑latency real‑time path for immediate analytics, and a batch path for deeper offline analysis. Both can interoperate, and the lakehouse layer remains accessible.

Three service dimensions are provided:

Real‑time link: Flink consumes DataHub streams, writes to Hologres for instant analysis.

Low‑latency/manual trigger: Flink/DataHub writes to MaxCompute, materialized views and query acceleration deliver fast batch‑derived insights.

Batch processing: MaxCompute handles large‑scale data ingestion and computation.

Data Modeling

Typical use cases, such as telecom traffic analysis, employ a snowflake schema to model real‑time traffic tables linked to dimension tables, forming an ODS → DWD → DWS layered warehouse.

Layered Warehouse

The ODS layer captures raw traffic and rule data; the DWD layer stores cleaned detail tables; aggregated tables can be materialized in Hologres or partitioned in MaxCompute for further summarization.

Case Scenarios

Merchant Order Count : Combine historical batch data with today’s real‑time data in partitioned Hologres tables to provide up‑to‑date order totals.

Connected‑Car Analytics : Ingest vehicle, CAN bus, and user behavior streams via DataHub, process with Flink into both Hologres (real‑time) and MaxCompute (offline). Real‑time data powers live dashboards, while offline pipelines feed AI/ML models for autonomous driving.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Hologres MaxCompute materialized view offline data warehouse progressive computing

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.