Big Data 15 min read

How to Optimize DAG Task Scheduling to Cut 30 Minutes from Critical Path

This article explains how to analyze and automatically optimize complex DAG‑based data platform task chains, identify bottlenecks, adjust upstream task timings, and reduce critical‑path execution time by up to 30 minutes while preventing resource contention and peak overloads.

Huolala Tech

Jul 6, 2023

How to Optimize DAG Task Scheduling to Cut 30 Minutes from Critical Path

Background

The Huolala data platform contains thousands of tasks forming a directed acyclic graph (DAG). Visualizing the entire dependency graph is impractical, so specific methods are needed to answer questions such as estimating terminal task completion time, locating delay sources, understanding task‑cluster resource relationships, and pinpointing which tasks cause resource peaks.

Task Actual Runtime

Tasks have a scheduled execution time. Even if upstream dependencies are satisfied, a task will not run until its scheduled time. Task states include "ready", "running", "success", and "failed". A task remains in the "ready" state when either (1) upstream tasks have not produced output, or (2) the scheduled time has not arrived.

Actual start time ≈ max(last upstream output time, scheduled time)

Because of scheduled times, downstream tasks may wait unnecessarily after upstream tasks finish, leading to avoidable delays.

Chain Complexity Assurance

“If only one task could run right after another, it would be ideal.”

Simple Example

Original Chain

The original chain contains ten tasks, with task 10 as the terminal task. To reduce its completion time, three adjustment actions were performed, moving task 10 forward by 30 minutes.

Adjusted Chain

For large‑scale reports with hundreds of tasks, manual adjustments are infeasible, prompting the need for automated optimization.

Optimization Algorithm

We first estimate how much earlier the terminal task could finish in an ideal state, then propagate adjustments upstream. Two assumptions are made:

Waiting less than one minute between tasks is considered negligible. If the calculated upstream adjustment is negative, no adjustment is made.

The longest chain adds up to 70 minutes, so the terminal task can be advanced to 1:10 am, saving 30 minutes.

Optimization Steps:

Adjust task 10 forward by 30 minutes.

Optimize upstream of task 10 (gap = 30).

Optimize upstream of task 8 (gap = 10).

Optimize upstream of task 9 (gap = 30).

Continue similarly for tasks 1, 2, 3, etc., stopping when adjustments become non‑positive.

The tasks highlighted in red (in the original diagram) indicate the required forward adjustments.

The overall algorithm flowchart is shown below:

Algorithm Implementation

import networkx as nx
import pandas as pd

# Generate the example DAG
edges = pd.DataFrame({
    "source": [1, 2, 2, 3, 4, 7, 6, 9, 5, 8],
    "target": [4, 4, 7, 6, 9, 9, 9, 10, 8, 10],
    "weight": [5, 10, 10, 40, 45, 20, 20, 10, 25, 5],
    "wait":   [40, 10, 20, 0.5, 0.5, 15, 30, 0.5, 0.5, 20],
    "adjust": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
})
G_test = nx.from_pandas_edgelist(edges, create_using=nx.DiGraph, edge_attr=True)
G_test.add_edge(10, 100000, weight=0.5, adjust=-30, wait=0.5)

# Optimization function
def opt_layer(G, pre_nodes):
    while pre_nodes:
        print('Start optimizing upstream of', pre_nodes)
        next_opt_nodes = []
        for i in pre_nodes:
            adjust_compare = pd.DataFrame(dict(G[i])).loc['adjust', :].min()
            pre = list(G.predecessors(i))
            next_opt_nodes = list(set(next_opt_nodes + pre))
            for j in pre:
                if G[j][i]['wait'] < 1:
                    G[j][i]['adjust'] = adjust_compare
                else:
                    new_adjust = int(G[j][i]['wait']) + adjust_compare
                    G[j][i]['adjust'] = min(new_adjust, G[j][i]['adjust'], 0)
        pre_nodes = next_opt_nodes
        print('---')
    return G

# Run optimization
G_opt = opt_layer(G_test, [10])

Running the script produces a DataFrame where the adjust column shows how many minutes each task should be moved forward.

Chain Decomposition

Task Composition

After identifying the longest chain, we can analyze recurring critical nodes such as order‑wide tables and driver‑wide tables. Their completion times help locate delay sources and assess overall health.

Overall Chain Duration

Chain duration = task waiting time + task execution time.

Waiting time can be reduced by adjusting scheduling to eliminate idle gaps; execution time can be improved by tuning container counts, addressing data skew, optimizing file merges, and refining computation logic.

Node Degree Analysis

High‑degree nodes often align with resource‑peak periods. Analyzing node degree helps identify tasks that cause downstream resource contention.

Resource Peaks

Two main reasons for resource peaks:

When a high‑degree node finishes, many downstream tasks arrive simultaneously, causing contention.

When a task processes a large data table, even few downstream tasks can consume significant resources.

Special Cases

Root nodes may depend on external tasks that cannot be rescheduled.

Some extraction tasks have fixed scheduling windows and cannot be moved.

Resource contention caused by task crowding can be unpredictable, making it hard to pinpoint a single culprit.

Analyzing only a single day may miss anomalies; it is better to examine longest chains over a longer period.

Future Considerations

Compare current longest chains with historical data to detect emerging delays.

Prioritize resources for tasks that frequently appear on the longest chain.

Stagger high‑impact tasks (e.g., wide tables, LBS, logs) to avoid simultaneous resource spikes.

Design wide tables to reduce redundant computation while scheduling less‑critical downstream tasks during off‑peak hours.

References & Resources

https://networkx.org/documentation/stable/index.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data DAG Task scheduling resource optimization workflow analysis

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.