Big Data 15 min read

Ant Group Real-Time Data Warehouse: Architecture, Solutions, and Data Lake Outlook

This article presents Ant Group's recent explorations and practices in real-time data warehousing, detailing its architecture, data quality assurance, stream‑batch integration, and future data lake implementation, while highlighting the use of Flink, ODPS, and Paimon for scalable, low‑latency analytics.

DataFunSummit

Aug 7, 2024

Ant Group Real-Time Data Warehouse: Architecture, Solutions, and Data Lake Outlook

The presentation outlines Ant Group's real-time data warehouse architecture, which comprises six core modules: compute engine, development platform, compute resources, real-time assets, development tools, and data quality, addressing challenges such as asset management, resource control, and platform robustness.

It describes the real-time data solution, including data source ingestion (logs, database logs, real-time messages), processing engines (Flink and ODPS), storage layers (SLS, Explorer, HBase), and the integration of low‑code development and stream‑batch unified processing to improve development efficiency and consistency.

Data quality assurance is tackled in two stages: pre‑deployment testing and runtime monitoring, covering task health (latency, failover, checkpoints) and data integrity checks (zero‑value, variance, threshold alerts), with a full‑link baseline for end‑to‑end timeliness monitoring.

The stream‑batch unified approach addresses field alignment between streaming and batch tables, employing virtual columns and hybrid meta‑tables to handle mismatched schemas, enabling a single task to compute real‑time cumulative metrics by combining daily aggregates with historical offline results.

Looking ahead, Ant Group plans to consolidate real‑time and offline data into a data lake using Paimon, simplifying the pipeline with a unified compute engine, storage, and asset management, and supporting multi‑level scheduling (daily, hourly, real‑time) for comprehensive data needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink stream processing Data Quality Real-time Data

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.