How We Scaled Our Data Platform by Migrating to Apache DolphinScheduler
Facing growing task volumes and diverse workload types, we upgraded our data development platform's scheduling engine to Apache DolphinScheduler, detailing the migration process, architectural enhancements, stability and observability improvements, multi‑tenant support, and the resulting performance gains and future roadmap.
Background
As the number and variety of tasks increased, our data development platform required a more robust scheduling engine. We migrated to Apache DolphinScheduler and share the practical experience and reflections.
Platform Architecture
The data computation layer handles all company‑wide data development needs, running various metric‑calculation tasks. Batch tasks run on the UDA data development platform, which supports end‑to‑end development scenarios (development, debugging, environment isolation, operation, monitoring). These capabilities rely heavily on a stable underlying scheduler.
The original scheduler, built in‑house around 2015, showed problems as task types and volumes grew:
Stability: frequent MySQL connection leaks and lock timeouts, causing scheduling bottlenecks.
Maintainability: core scheduler written in PHP with modules in Go, Java, Python; high maintenance cost and single points of failure.
Scalability: unable to keep up with fast‑growing business demands.
Observability: tasks launched via nohup often “fly away”, providing almost no metrics.
Core Requirements
Functionally, the scheduler must support multiple dependency forms, rich task types, custom extensions, online control, version rollback, and task lineage. System‑wise, high availability, tenant isolation, linear scalability, and observability are essential.
Migration to DolphinScheduler
After evaluating Airflow and DolphinScheduler, we migrated most tasks to DolphinScheduler over the past year.
Current task landscape on DolphinScheduler:
Task types: HiveSQL, SparkSQL, DorisSQL, PrestoSQL, and some shell tasks (remaining shell tasks still on the old scheduler).
Task volume: daily scheduling of tens of thousands of workflow instances, hundreds of thousands of task instances, with peaks of over 4,000 concurrent workflow instances. Migration is expected to double workflow instance count.
Improvements Made
Stability: leveraged DolphinScheduler’s well‑designed architecture to add features without major redesign.
SQL handling: enabled multi‑SQL submission per task by parsing SQL, managing connection pools, and using JDBC.
Data source enhancements: added richer attributes (different HiveServer2, Kyubbi, Presto coordinators) and permission‑controlled resource queues.
Stateless workers: uploaded resource files, DQL results, and logs to Tencent COS for true statelessness.
Load balancing, multi‑tenant isolation, and database optimizations.
Smooth Large‑Scale Migration
Three reasons required a seamless migration:
Long‑standing user habits around the old scheduler’s features and field names.
Over 20,000 workflows to migrate, affecting many critical data streams.
Broad business coverage (platform, live courses, hardware, books).
We achieved near‑zero user impact by bridging the old and new schedulers and using a DIFF mechanism.
Bridging Old and New Schedulers
During migration, tasks ran partly on the new scheduler and partly on the old one. We unified task instance status in the original scheduler’s database, preserving existing query APIs and ensuring that updates to migrated tasks also updated DolphinScheduler definitions.
We modified
DolphinScheduler DependentTaskProcessorto query both systems, allowing tasks to depend on instances from either scheduler without user awareness.
DIFF Validation
We created “mirror” tasks in DolphinScheduler that do not execute but allow us to compare:
Scheduling times for compatibility of dependencies and cron settings.
SQL content after variable substitution, queue configuration, and masking to ensure exact matches.
After confirming no significant DIFF, we switched production tasks to the new engine.
Observability Enhancements
DolphinScheduler now exports Prometheus‑compatible metrics, which we transformed into Falcon format for internal monitoring. We added high‑priority metrics and integrated phone/DingTalk alerts.
Improved observability reduced troubleshooting effort; for example, a zero‑value curve outside work hours indicates a user‑triggered anomaly.
Post‑migration, we closely monitor misfire rates, worker thread‑pool saturation, connection‑pool usage, I/O utilization, and overload indicators.
Migration Benefits
Database QPS reduced from >10,000 to ~500.
Load dropped from 4.0 to 1.0.
Overall resource usage decreased by 65%.
We now support SparkSQL, DorisSQL, and newer PrestoSQL tasks with minimal development effort.
Future Plans
Fully migrate routine tasks and debugging capabilities to DolphinScheduler, codify operational procedures (SOP).
Leverage community containerization progress to deploy modules on Kubernetes; API module already in production, Worker and Master components in progress.
Implement one‑click full‑link data back‑trace.
Integrate offline and real‑time platforms.
Zuoyebang Tech Team
Sharing technical practices from Zuoyebang
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.