Big Data 14 min read

Youzan Data Platform and DP Data Development Platform: Architecture, Core Modules, and Scheduling System Upgrade

This article presents an in‑depth overview of Youzan's data platform, introduces the DP data development platform with its key features and workflow, details the core module architecture—including service, scheduling, and component layers—and explains the migration from Airflow to DolphinScheduler to improve performance, stability, and scalability.

DataFunTalk
DataFunTalk
DataFunTalk
Youzan Data Platform and DP Data Development Platform: Architecture, Core Modules, and Scheduling System Upgrade

Overview – Youzan, a retail technology SaaS provider, has built a data middle‑platform to support over 60,000 daily scheduled tasks, offering product, service, and platform layers for internal operations, developers, and data integration.

DP Data Development Platform – The DP platform provides a unified environment for offline data tasks, shielding Hadoop complexity, ensuring data security, and supporting data synchronization, task scheduling, large‑scale offline computation, instant SQL queries, monitoring & alerts, and standardized development processes.

Data Development Workflow – A typical offline data development process involves importing data, transforming it, and exporting results, with each step representing a state transition in a data asset DAG; the platform abstracts these transitions to manage dependencies and lineage.

Core Module Architecture

Service layer: job creation, testing, publishing, and operation management with HA master nodes.

Scheduling layer: built on Airflow (originally) and later upgraded to DolphinScheduler, providing failover, load balancing, and global priority handling.

Task component & base component layers: middleware and big‑data runtime environments.

Monitoring layer: resource, log, and intelligent monitoring with alerts.

Scheduling System Upgrade – The original Airflow‑based scheduler faced performance, stability, and integration issues; DolphinScheduler was chosen for its distributed, multi‑master architecture, higher throughput, Java stack compatibility, richer features, and better HA.

Future Outlook – The DP platform will continue to improve usability, efficiency, stability, and data accuracy, while the new scheduling system will be gradually rolled out after extensive testing.

architecturebig datadata platformSchedulingData DevelopmentDolphinScheduler
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.