How Youzan Built a Scalable Big Data Development Platform (DP)
This article details the design, architecture, and operational experience of Youzan's Data Platform (DP), covering its scheduling, data‑sync, service, and monitoring modules, the custom Airflow‑based task scheduler, current production metrics, supported task types, and future improvement plans.
Background and Motivation
As Youzan’s business grew, the demand for offline big‑data applications such as data synchronization (MySQL, Hive, HBase, Elasticsearch), offline computation (Hive, MapReduce, Spark), scheduling, result querying, and failure alerts increased dramatically. Before a unified platform existed, teams faced multiple entry points, high operational costs, steep Hadoop learning curves, duplicated development effort, and frequent cross‑department communication.
System Design Overview
The Data Platform (DP) was created to address these pain points through a visual interface that abstracts underlying big‑data tools.
Architecture
DP consists of four main modules:
Task Scheduling Module : Built on a heavily customized Apache Airflow, it adds multi‑queue priority scheduling, support for various task types (DataX, Datay, email export, ES export, Spark, etc.), global DAG‑based priority calculation, cross‑DAG dependency visualization, and one‑click clearing of downstream dependencies.
Base Module : Provides offline full/incremental data sync, binlog‑based incremental sync, Hive‑to‑ES/email export, and a developing MySQL‑to‑HBase sync.
Service Module : Manages job lifecycle (create, modify, test, release, operations) using a Master/Slave deployment. The Master handles HA, hot‑restart, job management, test task distribution, and resource monitoring. Slave nodes execute commands from the Master and update resources via GitLab.
Monitoring Module : Offers three layers of monitoring – basic resource/alert monitoring, log monitoring via Kafka → Spark Streaming → NoSQL storage, and task‑prediction monitoring that simulates future schedules to anticipate failures or timeouts.
Task Scheduling Design
DP’s scheduler must support diverse task types, high concurrency (hundreds of tasks per second), priority for critical data‑warehouse jobs, load balancing across workers, high availability, and user‑friendly status/log displays.
After evaluating Azkaban, Oozie, and Airflow, the team chose Airflow + Celery + Redis + MySQL and applied deep customizations:
Support for Multiple Task Types : Implemented custom Operators for DataX import/export, Binlog‑based Datay tasks, Hive‑to‑Email, Hive‑to‑Elasticsearch, etc.
High Concurrency Management : Utilized Airflow’s Pool + Queue + Slot mechanism and Celery’s distributed workers, allowing virtually unlimited parallelism.
Priority Scheduling : Calculated a global DAG priority based on upstream/downstream relationships and marked important nodes, ensuring critical jobs run first.
Resource‑Aware Load Balancing : Classified tasks by resource profile (CPU‑intensive Spark, memory‑intensive DataX, etc.), placed each class in its own queue with dedicated slot limits, and assigned multiple queues per worker to keep CPU/memory usage balanced (see Fig. 5).
High Availability : Deployed an active‑standby Scheduler pair; the standby monitors the active node’s health and takes over automatically if the active fails.
User‑Friendly UI : Leveraged Airflow’s built‑in web UI for clear status and log presentation.
* Current implemented features: analyze potentially failing task lists (failures may stem from DB configuration changes, upstream node failures, etc.) and send alerts; based on recent runtime data, simulate the entire task schedule to compute start/end times and timeout warnings. * Future plans: predict task runtime not only from historical data but also from input data volume, cluster resource utilization, and computational complexity using multiple feature dimensions.Production Status
DP was initiated in January 2017, entered production in June, and has undergone several iteration cycles. The current deployment includes 2 Master nodes and 13 Slave nodes (2 for Scheduler, 11 for Workers), handling over 7,000 daily scheduled tasks across data‑warehouse, BI, e‑commerce, and payment lines.
Supported Task Types
Offline data sync: MySQL↔Hive full/incremental (DataX), MySQL→HBase (in development), Hive→Elasticsearch, MySQL→Hive via Binlog+Nsq/HDFS/MapReduce (Datay), etc.
Hadoop jobs: Hive, MapReduce, Spark, Spark SQL.
Other tasks: Export Hive tables via email (PDF/Excel/Txt), Python/Shell/JAR scripts.
Summary and Outlook
After a year and a half of continuous iteration, DP now reliably schedules 7k+ tasks per day with improved stability and usability, covering most offline big‑data development scenarios. Future work includes expanding task type coverage, deeper integration with other platforms for a one‑stop big‑data experience, providing a user dashboard for centralized management, and enhancing log management for faster error diagnosis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
