Big Data 11 min read

Low-Code Real-Time Data Warehouse Construction System Using Flink

This article describes a low‑code, Flink‑based real‑time data‑warehouse construction system that abstracts the warehouse building process into ODS, DWD, DWS, and ADS layers, leverages a domain‑specific language and plugin engine to reduce development effort, and details its architecture, DSL design, plugin extensibility, dimension‑table completion, stream merging, aggregation, and storage strategies.

58 Tech

May 5, 2022

Low-Code Real-Time Data Warehouse Construction System Using Flink

Introduction The paper introduces a low‑code real‑time data‑warehouse construction system built on Flink, abstracting the warehouse building process into engineering‑driven layers so developers can focus on data rather than domain‑specific logic.

Background With increasing data‑driven business demands, traditional warehouse development requires extensive Flink knowledge for filtering, transformation, and aggregation, leading to high learning and development costs; a low‑code approach aims to alleviate this.

Overall Architecture The system follows the classic ODS → DWD → DWS → ADS layering. ODS stores raw logs, DWD cleans and enriches data into detailed tables, DWS performs lightweight aggregations, and ADS provides real‑time storage for dashboards, ad‑hoc queries, etc.

Zero‑Code Design The system uses a custom domain‑specific language (DSL) to express the data flow as a directed graph of Source → Transform → Aggregation → Sink (STAS). Users configure the DSL without writing code, and the engine parses the graph to drive processing.

Custom Rule Engine Instead of generic engines like Drools, a tailored rule engine is built to handle the fixed domain of data‑warehouse construction, reducing syntax complexity and improving execution efficiency. The engine focuses on parsing DSL graphs and executing transformation rules, while input/output are limited to known data sources and sinks.

Plugin‑Based Extensibility The DSL is implemented via plugins written in Java. Plugins are classified as functional (prefixed with $P) or syntactic (prefixed with #). The system ships with core plugins and allows additional JARs to be downloaded, loaded, hit, and executed, enabling high extensibility.

Dimension Table Completion Two approaches are provided: (1) cache‑sync loading external stores (MySQL, HBase, etc.) into memory at startup for low‑frequency, small‑volume lookups; (2) real‑time queries to external databases/services for high‑frequency, low‑volume needs, with custom plugins for complex logic.

Multi‑Stream Merging The system supports window‑based joins (sliding, tumbling, session) and interval joins for non‑windowed merging, addressing cases where streams have time gaps that prevent simple window joins.

Aggregation Operations For the DWS layer, both simple Flink‑based aggregation plugins and more complex Flink‑SQL aggregation plugins are offered to support reporting, dashboards, and other analytical workloads.

Data Warehouse Construction The system can sink results to HDFS (hourly partitions using event‑time) and ClickHouse (via Kafka ingestion), supporting both real‑time and offline analytics.

Progress and Outlook The system currently processes over 30 billion records daily, reducing warehouse construction effort from 2–3 person‑days to an hour‑level effort, and future work includes extending the platform to real‑time monitoring, feature engineering, and model training.

big data Flink DSL Streaming Real-time Data Warehouse

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.