Big Data 12 min read

How Youzan Built a Scalable Big Data Development Platform (DP)

This article details the design, architecture, and operational experience of Youzan's Data Platform (DP), covering its scheduling, data‑sync, service, and monitoring modules, the custom Airflow‑based task scheduler, current production metrics, supported task types, and future improvement plans.

Youzan Coder

Jul 20, 2018

How Youzan Built a Scalable Big Data Development Platform (DP)

Background and Motivation

As Youzan’s business grew, the demand for offline big‑data applications such as data synchronization (MySQL, Hive, HBase, Elasticsearch), offline computation (Hive, MapReduce, Spark), scheduling, result querying, and failure alerts increased dramatically. Before a unified platform existed, teams faced multiple entry points, high operational costs, steep Hadoop learning curves, duplicated development effort, and frequent cross‑department communication.

System Design Overview

The Data Platform (DP) was created to address these pain points through a visual interface that abstracts underlying big‑data tools.

Architecture

DP consists of four main modules:

Task Scheduling Module : Built on a heavily customized Apache Airflow, it adds multi‑queue priority scheduling, support for various task types (DataX, Datay, email export, ES export, Spark, etc.), global DAG‑based priority calculation, cross‑DAG dependency visualization, and one‑click clearing of downstream dependencies.

Base Module : Provides offline full/incremental data sync, binlog‑based incremental sync, Hive‑to‑ES/email export, and a developing MySQL‑to‑HBase sync.

Service Module : Manages job lifecycle (create, modify, test, release, operations) using a Master/Slave deployment. The Master handles HA, hot‑restart, job management, test task distribution, and resource monitoring. Slave nodes execute commands from the Master and update resources via GitLab.

Monitoring Module : Offers three layers of monitoring – basic resource/alert monitoring, log monitoring via Kafka → Spark Streaming → NoSQL storage, and task‑prediction monitoring that simulates future schedules to anticipate failures or timeouts.

Task Scheduling Design

DP’s scheduler must support diverse task types, high concurrency (hundreds of tasks per second), priority for critical data‑warehouse jobs, load balancing across workers, high availability, and user‑friendly status/log displays.

After evaluating Azkaban, Oozie, and Airflow, the team chose Airflow + Celery + Redis + MySQL and applied deep customizations:

Support for Multiple Task Types : Implemented custom Operators for DataX import/export, Binlog‑based Datay tasks, Hive‑to‑Email, Hive‑to‑Elasticsearch, etc.

High Concurrency Management : Utilized Airflow’s Pool + Queue + Slot mechanism and Celery’s distributed workers, allowing virtually unlimited parallelism.

Priority Scheduling : Calculated a global DAG priority based on upstream/downstream relationships and marked important nodes, ensuring critical jobs run first.

Resource‑Aware Load Balancing : Classified tasks by resource profile (CPU‑intensive Spark, memory‑intensive DataX, etc.), placed each class in its own queue with dedicated slot limits, and assigned multiple queues per worker to keep CPU/memory usage balanced (see Fig. 5).

High Availability : Deployed an active‑standby Scheduler pair; the standby monitors the active node’s health and takes over automatically if the active fails.

User‑Friendly UI : Leveraged Airflow’s built‑in web UI for clear status and log presentation.

* Current implemented features: analyze potentially failing task lists (failures may stem from DB configuration changes, upstream node failures, etc.) and send alerts; based on recent runtime data, simulate the entire task schedule to compute start/end times and timeout warnings.

* Future plans: predict task runtime not only from historical data but also from input data volume, cluster resource utilization, and computational complexity using multiple feature dimensions.

Production Status

DP was initiated in January 2017, entered production in June, and has undergone several iteration cycles. The current deployment includes 2 Master nodes and 13 Slave nodes (2 for Scheduler, 11 for Workers), handling over 7,000 daily scheduled tasks across data‑warehouse, BI, e‑commerce, and payment lines.

Supported Task Types

Offline data sync: MySQL↔Hive full/incremental (DataX), MySQL→HBase (in development), Hive→Elasticsearch, MySQL→Hive via Binlog+Nsq/HDFS/MapReduce (Datay), etc.

Hadoop jobs: Hive, MapReduce, Spark, Spark SQL.

Other tasks: Export Hive tables via email (PDF/Excel/Txt), Python/Shell/JAR scripts.

Summary and Outlook

After a year and a half of continuous iteration, DP now reliably schedules 7k+ tasks per day with improved stability and usability, covering most offline big‑data development scenarios. Future work includes expanding task type coverage, deeper integration with other platforms for a one‑stop big‑data experience, providing a user dashboard for centralized management, and enhancing log management for faster error diagnosis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems monitoring Big Data Task scheduling Data Platform Data synchronization Airflow

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.