Big Data 13 min read

Why DP Switched from Airflow to DolphinScheduler: A Deep Dive into Scaling the Data Platform

The article examines DP's rapid growth in daily scheduled tasks, outlines the limitations of its Airflow‑based scheduler, compares Airflow with DolphinScheduler, and details the architectural redesign, migration steps, and future plans for a more scalable, reliable big‑data workflow system.

Youzan Coder
Youzan Coder
Youzan Coder
Why DP Switched from Airflow to DolphinScheduler: A Deep Dive into Scaling the Data Platform

Overview

In 2017 DP (Data Platform) built its scheduling system with Apache Airflow 1.7, handling over 7,000 daily tasks. As business expanded, the daily task count surged to more than 60,000, exposing scalability and reliability issues.

Current DP Scheduling Architecture

The original design combined Airflow, Celery, Redis, and MySQL. Redis served as the queue, while Celery enabled horizontal scaling of workers. A custom Airflow Scheduler Failover Controller added a standby scheduler for high availability, and task types were split into CPU‑intensive and memory‑intensive queues.

Pain Points of Airflow 1.x

Heavy customizations detached the system from the upstream community, making upgrades to newer versions prohibitively costly.

The Python‑centric stack conflicted with DP's Java‑dominant development environment, increasing iteration and operational overhead.

Performance bottlenecks: the single‑node scheduler parsed all DAG files, causing long delays when DAG count grew.

Stability concerns: the failover controller’s master‑slave model could misjudge node health, leading to deadlocks and scheduling failures.

Upgrade Options: Airflow vs. DolphinScheduler

DP evaluated upgrading to Airflow 2.0 but rejected it due to the high migration cost. An alternative open‑source scheduler, Apache DolphinScheduler (DS), was benchmarked against Airflow on stability, ease of use, functionality, and extensibility.

Performance : DS 1.3.8 achieved roughly twice the throughput of Airflow 1.7 under identical conditions.

Deployment : DS, built on a Java stack, integrates with DP's OPS deployment pipeline, supports K8s and Docker, and reduces operational effort.

Features : DS offers a more user‑friendly UI, worker grouping for resource isolation, linear scalability with cluster size, and plugin‑based task/alert components (DS‑2.0).

Reliability : DS’s multi‑master, multi‑worker architecture provides high availability and dynamic service registration.

Community : The DolphinScheduler community in China is active, with frequent releases and detailed documentation.

After comprehensive evaluation, DP decided to adopt DolphinScheduler.

Integration Architecture Design

Retain the existing DP web UI and service layer.

Refactor the scheduling UI, removing the embedded Airflow interface.

Interact with DS via its REST API for task lifecycle and scheduling management.

Leverage DS projects to isolate test and production workflow configurations.

Migration Steps

Workflow Definition Alignment

DP’s unified workflow and schedule states had to be mapped to DS’s separate workflow‑definition and schedule states, requiring adjustments in the task testing and publishing pipelines.

Task Execution Refactor

Previously, DP generated DAG files on the master, synced them to workers, and invoked airflow test. After migration, DP creates DS workflow definitions via API, triggers execution, and polls DS logs for real‑time feedback.

Workflow Publishing Refactor

Instead of syncing DAG files for the scheduler to scan, DP now pushes workflow definitions and schedule configurations directly to DS, which handles dynamic service registration and execution.

Capability Enhancements

Catch‑up Mechanism : DS will adopt Airflow‑style catch‑up to automatically backfill missed runs when the scheduler recovers from failures.

Cross‑DAG Global Backfill : By extending Airflow’s clear function, DP can clear downstream instances, apply pruning rules, and rely on catch‑up to re‑execute tasks in correct order.

Current Status & Future Plan

DP has deployed DS services in a test environment, migrated all workflows, and achieved dual‑run in QA. The next steps include staged performance and stress testing, a gray‑release in production by December, full migration in January, and completion of production migration by March.

big dataworkflowdata platformschedulingAirflowDolphinScheduler
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.