Big Data 19 min read

Optimizing Workflow in Data Warehouse Construction

This article analyzes workflow scenarios in data warehouse construction, proposes an optimization scheme that abstracts workflow nodes into task and instance layers, and demonstrates how task attributes and generation rules can improve configurability, dependency management, and collaborative development for large‑scale data warehouse projects.

Big Data Technology Architecture

Jun 11, 2020

Optimizing Workflow in Data Warehouse Construction

With the IT era transitioning to the digital transformation (DT) era, extracting value from data has become increasingly important. Data warehouse systems have long been a core component of enterprise IT architecture and are now integrating with big‑data technologies to enable intelligent, data‑driven enterprises.

The article first introduces fundamental concepts of data warehouses, including the definition by William Inmon, the role of OLAP, multidimensional models, and dimensional modeling (Kimball). It then outlines the six typical layers of a data warehouse architecture—STG, ODS, DWD, DWS, ADS, and DIM—explaining the purpose of each layer and the benefits of layered data processing such as avoiding siloed development, simplifying complex problems, enabling data lineage tracking, and clarifying data responsibilities.

Next, the article examines workflow applications in data warehouse construction. Workflow, originally from production and office automation, is defined by the Workflow Management Coalition (WfMC) as a set of automatically executable business processes that move documents, information, or tasks between participants. The article presents a directed‑acyclic‑graph (DAG) representation of data‑warehouse tasks, illustrating how workflow nodes correspond to data‑processing tasks and how edges represent data dependencies.

Current workflow management systems used in data‑warehouse scenarios (Azkaban, Oozie, Airflow) are reviewed, highlighting three main issues: limited support for multi‑developer collaboration, difficulty handling complex inter‑task scheduling dependencies, and challenges with historical data repair when tasks need to be re‑executed.

To address these problems, the article proposes an optimized workflow management approach that abstracts workflow nodes into two layers: task layer (static definition of a data‑processing job, including code such as Shell, Hive SQL, Spark, and attributes like period and dependencies) and instance layer (concrete execution units generated from tasks based on period and dependency attributes). This separation enables automatic generation of instance DAGs, simplifies configuration, and supports collaborative development.

The task attributes are detailed as follows:

Period attribute : specifies the scheduling cycle (day, hour, week, month) and the exact execution times.

Dependency attribute : includes inter‑task dependencies (parent tasks) and self‑dependencies (previous executions of the same task).

Instance generation follows two steps: first, instances are created according to the period attribute; second, dependencies between instances are established based on the dependency attribute, with examples covering hour‑to‑hour, hour‑to‑day, and day‑to‑hour scenarios. The article includes several illustrative diagrams (provided as images) to clarify these rules.

Optimization effects of the proposed scheme include:

Enhanced configurability: developers only need to modify the period and dependency of a single task without altering the entire workflow.

Reduced complexity of dependency configuration by automatically deriving instance‑level dependencies from task attributes.

Support for constructing sub‑workflows rooted at any task, facilitating historical data repair and other advanced use cases.

In conclusion, abstracting workflow nodes into task and instance layers and leveraging period and dependency attributes significantly improve the usability and scalability of workflow management in data‑warehouse projects, while also laying the groundwork for further integration of data quality monitoring, task tracing, and process optimization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data workflow Task scheduling dependency management ETL

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.