Big Data 21 min read

Optimizing Workflow in Data Warehouse Construction: A Layered Task‑Instance Approach

The article analyzes data‑warehouse workflow scenarios, explains core concepts such as OLAP, multidimensional modeling and layer architecture, reviews existing workflow engines like Azkaban, Oozie and Airflow, and proposes a task‑and‑instance layered optimization that simplifies dependency configuration, improves collaboration, and supports complex scheduling in modern big‑data environments.

Sohu Tech Products

Jul 8, 2020

Optimizing Workflow in Data Warehouse Construction: A Layered Task‑Instance Approach

With the IT era moving into the digital transformation (DT) era, extracting value from data has become increasingly important. Data‑warehouse systems have long been a crucial part of enterprise IT architecture and are now merging with big‑data technologies to enable intelligent data‑driven enterprises.

This article focuses on workflow use cases in data‑warehouse construction, analyzes the characteristics of data‑warehouse workflows, and proposes an optimized workflow solution tailored to data‑warehouse needs.

1. Data‑Warehouse Basics

Data‑warehouse was first defined by William Inmon in 1991 as a subject‑oriented, integrated, stable, and time‑variant collection of data used for analytical processing, not merely a storage system.

OLAP (On‑line Analytical Processing) is the most common analytical technique in data‑warehouses, enabling users to quickly and interactively explore data from multiple perspectives.

The multidimensional model is the core of OLAP, supporting operations such as drill‑down, roll‑up, slice, dice, and pivot.

Dimensional modeling, introduced by Kimball, maps the multidimensional model to relational tables, separating dimension tables (describing attributes) from fact tables (storing measures).

2. Main Work of Data‑Warehouse Construction

The core components of a data‑warehouse architecture are three layers: the raw data layer, the warehouse layer, and the front‑end application layer.

Data flows from raw sources (e.g., business data, server logs) into the warehouse, where it is processed and then served to front‑end applications such as BI, search, and recommendation.

Layered processing brings several benefits:

Prevents siloed development and reduces duplicate work.

Simplifies complex problems by breaking tasks into manageable steps.

Enables data lineage tracking for quick issue localization.

Provides clear responsibilities for each layer, improving usability.

The warehouse is typically divided into six logical layers: STG (raw), ODS (operational), DWD (detail), DWS (service), ADS (application), and DIM (dimension). Data moves upward through these layers, being aggregated, cleaned, and reshaped to meet analytical needs.

3. Workflow in Data‑Warehouse Construction

Workflow originated from production and office automation, defining a series of well‑structured tasks or roles that are executed and monitored to improve efficiency and control.

The Workflow Management Coalition (WfMC) defines a workflow as a fully automated business process that moves documents, information, or tasks between participants according to a set of rules.

In the context of data‑warehouse construction, workflow can be visualized as a directed acyclic graph (DAG) where nodes represent data‑processing tasks and edges represent data dependencies.

3.1 Application Scenarios

Existing workflow engines for data‑warehouse projects include Azkaban, Oozie, and Airflow. They share common limitations:

They treat the entire workflow as a single unit, making collaborative development difficult when multiple engineers work on different task nodes.

Configuring dependencies across different schedules (e.g., hourly vs. daily) often requires extensive custom code.

Handling historical data repairs for specific metrics is cumbersome when the workflow is scheduled as a whole.

3.2 Proposed Optimization

The proposed system abstracts workflow nodes into two levels: tasks (static definitions) and instances (runtime executions). Engineers define only the period and dependency attributes of a task; the system automatically generates instance‑level DAGs.

Task Layer

Each task includes processing logic (e.g., Shell, Hive SQL, Spark) and attributes such as period (day, hour, etc.) and dependencies (both inter‑task and self‑dependency).

Instance Layer

Based on task attributes, the system creates concrete instances for each scheduled execution. An instance can be scheduled only when all its parent instances have completed and its scheduled time has arrived.

Generation Rules

1. Generate instances according to the period attribute (daily, hourly, weekly, etc.).

2. Build instance dependencies based on the defined task dependencies, handling cases where the number of child instances equals or differs from the number of parent instances.

Examples include:

Hour‑to‑hour dependency with matching instance counts (A1 → B1, A2 → B2, …).

Hour‑to‑hour dependency with mismatched counts (each child instance depends on the latest parent instance not later than its own schedule).

Hour‑to‑day and day‑to‑hour dependencies, with options for depending on all parent instances or the nearest preceding instance.

Self‑dependency where a task’s current instance depends on its previous execution.

3.3 Optimization Benefits

Task configuration becomes independent; adding or modifying a task does not require changing the whole workflow.

Complex instance dependencies are derived automatically from simple period and dependency attributes, reducing configuration complexity.

Sub‑workflows can be constructed from any root task, facilitating historical data repair and targeted re‑processing.

4. Conclusion

By abstracting workflow nodes into task and instance layers and leveraging period and dependency attributes, the proposed approach simplifies workflow management for data‑warehouse projects, improves collaborative development, and supports advanced features such as data‑quality monitoring and process optimization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data workflow Task scheduling dependency management ETL

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.