Big Data 12 min read

How We Scaled Our Data Platform by Migrating to Apache DolphinScheduler

Facing growing task volumes and diverse workload types, we upgraded our data development platform's scheduling engine to Apache DolphinScheduler, detailing the migration process, architectural enhancements, stability and observability improvements, multi‑tenant support, and the resulting performance gains and future roadmap.

Zuoyebang Tech Team

Dec 28, 2023

How We Scaled Our Data Platform by Migrating to Apache DolphinScheduler

Background

As the number and variety of tasks increased, our data development platform required a more robust scheduling engine. We migrated to Apache DolphinScheduler and share the practical experience and reflections.

Platform Architecture

The data computation layer handles all company‑wide data development needs, running various metric‑calculation tasks. Batch tasks run on the UDA data development platform, which supports end‑to‑end development scenarios (development, debugging, environment isolation, operation, monitoring). These capabilities rely heavily on a stable underlying scheduler.

The original scheduler, built in‑house around 2015, showed problems as task types and volumes grew:

Stability: frequent MySQL connection leaks and lock timeouts, causing scheduling bottlenecks.

Maintainability: core scheduler written in PHP with modules in Go, Java, Python; high maintenance cost and single points of failure.

Scalability: unable to keep up with fast‑growing business demands.

Observability: tasks launched via nohup often “fly away”, providing almost no metrics.

Core Requirements

Functionally, the scheduler must support multiple dependency forms, rich task types, custom extensions, online control, version rollback, and task lineage. System‑wise, high availability, tenant isolation, linear scalability, and observability are essential.

Migration to DolphinScheduler

After evaluating Airflow and DolphinScheduler, we migrated most tasks to DolphinScheduler over the past year.

Current task landscape on DolphinScheduler:

Task types: HiveSQL, SparkSQL, DorisSQL, PrestoSQL, and some shell tasks (remaining shell tasks still on the old scheduler).

Task volume: daily scheduling of tens of thousands of workflow instances, hundreds of thousands of task instances, with peaks of over 4,000 concurrent workflow instances. Migration is expected to double workflow instance count.

Improvements Made

Stability: leveraged DolphinScheduler’s well‑designed architecture to add features without major redesign.

SQL handling: enabled multi‑SQL submission per task by parsing SQL, managing connection pools, and using JDBC.

Data source enhancements: added richer attributes (different HiveServer2, Kyubbi, Presto coordinators) and permission‑controlled resource queues.

Stateless workers: uploaded resource files, DQL results, and logs to Tencent COS for true statelessness.

Load balancing, multi‑tenant isolation, and database optimizations.

Smooth Large‑Scale Migration

Three reasons required a seamless migration:

Long‑standing user habits around the old scheduler’s features and field names.

Over 20,000 workflows to migrate, affecting many critical data streams.

Broad business coverage (platform, live courses, hardware, books).

We achieved near‑zero user impact by bridging the old and new schedulers and using a DIFF mechanism.

Bridging Old and New Schedulers

During migration, tasks ran partly on the new scheduler and partly on the old one. We unified task instance status in the original scheduler’s database, preserving existing query APIs and ensuring that updates to migrated tasks also updated DolphinScheduler definitions.

We modified DolphinScheduler DependentTaskProcessor to query both systems, allowing tasks to depend on instances from either scheduler without user awareness.

DIFF Validation

We created “mirror” tasks in DolphinScheduler that do not execute but allow us to compare:

Scheduling times for compatibility of dependencies and cron settings.

SQL content after variable substitution, queue configuration, and masking to ensure exact matches.

After confirming no significant DIFF, we switched production tasks to the new engine.

Observability Enhancements

DolphinScheduler now exports Prometheus‑compatible metrics, which we transformed into Falcon format for internal monitoring. We added high‑priority metrics and integrated phone/DingTalk alerts.

Improved observability reduced troubleshooting effort; for example, a zero‑value curve outside work hours indicates a user‑triggered anomaly.

Post‑migration, we closely monitor misfire rates, worker thread‑pool saturation, connection‑pool usage, I/O utilization, and overload indicators.

Migration Benefits

Database QPS reduced from >10,000 to ~500.

Load dropped from 4.0 to 1.0.

Overall resource usage decreased by 65%.

We now support SparkSQL, DorisSQL, and newer PrestoSQL tasks with minimal development effort.

Future Plans

Fully migrate routine tasks and debugging capabilities to DolphinScheduler, codify operational procedures (SOP).

Leverage community containerization progress to deploy modules on Kubernetes; API module already in production, Worker and Master components in progress.

Implement one‑click full‑link data back‑trace.

Integrate offline and real‑time platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Migration Big Data Observability Data Platform Apache DolphinScheduler

Written by

Zuoyebang Tech Team

Sharing technical practices from Zuoyebang

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.