Why DP Switched from Airflow to DolphinScheduler: A Deep Dive into Scaling the Data Platform
The article examines DP's rapid growth in daily scheduled tasks, outlines the limitations of its Airflow‑based scheduler, compares Airflow with DolphinScheduler, and details the architectural redesign, migration steps, and future plans for a more scalable, reliable big‑data workflow system.
Overview
In 2017 DP (Data Platform) built its scheduling system with Apache Airflow 1.7, handling over 7,000 daily tasks. As business expanded, the daily task count surged to more than 60,000, exposing scalability and reliability issues.
Current DP Scheduling Architecture
The original design combined Airflow, Celery, Redis, and MySQL. Redis served as the queue, while Celery enabled horizontal scaling of workers. A custom Airflow Scheduler Failover Controller added a standby scheduler for high availability, and task types were split into CPU‑intensive and memory‑intensive queues.
Pain Points of Airflow 1.x
Heavy customizations detached the system from the upstream community, making upgrades to newer versions prohibitively costly.
The Python‑centric stack conflicted with DP's Java‑dominant development environment, increasing iteration and operational overhead.
Performance bottlenecks: the single‑node scheduler parsed all DAG files, causing long delays when DAG count grew.
Stability concerns: the failover controller’s master‑slave model could misjudge node health, leading to deadlocks and scheduling failures.
Upgrade Options: Airflow vs. DolphinScheduler
DP evaluated upgrading to Airflow 2.0 but rejected it due to the high migration cost. An alternative open‑source scheduler, Apache DolphinScheduler (DS), was benchmarked against Airflow on stability, ease of use, functionality, and extensibility.
Performance : DS 1.3.8 achieved roughly twice the throughput of Airflow 1.7 under identical conditions.
Deployment : DS, built on a Java stack, integrates with DP's OPS deployment pipeline, supports K8s and Docker, and reduces operational effort.
Features : DS offers a more user‑friendly UI, worker grouping for resource isolation, linear scalability with cluster size, and plugin‑based task/alert components (DS‑2.0).
Reliability : DS’s multi‑master, multi‑worker architecture provides high availability and dynamic service registration.
Community : The DolphinScheduler community in China is active, with frequent releases and detailed documentation.
After comprehensive evaluation, DP decided to adopt DolphinScheduler.
Integration Architecture Design
Retain the existing DP web UI and service layer.
Refactor the scheduling UI, removing the embedded Airflow interface.
Interact with DS via its REST API for task lifecycle and scheduling management.
Leverage DS projects to isolate test and production workflow configurations.
Migration Steps
Workflow Definition Alignment
DP’s unified workflow and schedule states had to be mapped to DS’s separate workflow‑definition and schedule states, requiring adjustments in the task testing and publishing pipelines.
Task Execution Refactor
Previously, DP generated DAG files on the master, synced them to workers, and invoked airflow test. After migration, DP creates DS workflow definitions via API, triggers execution, and polls DS logs for real‑time feedback.
Workflow Publishing Refactor
Instead of syncing DAG files for the scheduler to scan, DP now pushes workflow definitions and schedule configurations directly to DS, which handles dynamic service registration and execution.
Capability Enhancements
Catch‑up Mechanism : DS will adopt Airflow‑style catch‑up to automatically backfill missed runs when the scheduler recovers from failures.
Cross‑DAG Global Backfill : By extending Airflow’s clear function, DP can clear downstream instances, apply pruning rules, and rely on catch‑up to re‑execute tasks in correct order.
Current Status & Future Plan
DP has deployed DS services in a test environment, migrated all workflows, and achieved dual‑run in QA. The next steps include staged performance and stress testing, a gray‑release in production by December, full migration in January, and completion of production migration by March.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
