Big Data 15 min read

Design and Implementation of a DataX‑Based Data Synchronization Platform at Youzan

Youzan replaced Sqoop with a customized DataX‑based platform that integrates with its offline scheduler to reliably sync MySQL, HBase, Elasticsearch and file sources to Hive, handling schema changes, sharding, rate‑limiting and logging, and has processed billions of rows daily with high stability.

Youzan Coder
Youzan Coder
Youzan Coder
Design and Implementation of a DataX‑Based Data Synchronization Platform at Youzan

In the early stage of Youzan's big‑data platform, Sqoop was used for MySQL‑Hive data sync, but growing business requirements soon exceeded its capabilities. The team identified several pain points: frequent sync failures caused by MySQL schema changes, the need for read/write separation and sharding, avoidance of operational alerts, and support for additional sources such as HBase, Elasticsearch and plain files.

To address these needs, the team evaluated open‑source tools DataX and Sqoop. A feature comparison showed that DataX offered multi‑threaded single‑process execution, better flow control, built‑in statistics, and a more active community, while Sqoop relied on MapReduce and lacked many of the desired features. Consequently, DataX was selected as the foundation.

The early design focused on integrating DataX with the existing offline task scheduler. The scheduler triggers DataX jobs, while DataX handles only the data movement. This approach reuses existing scheduling capabilities and avoids duplicate development. A diagram (omitted) illustrated DataX workers running on each platform node, with multiple processes managed by the scheduler.

Key customizations were made to the DataX executor:

Reporting job status (started, running, finished) and progress back to the platform.

Streaming logs to the platform’s log system for real‑time monitoring.

Passing parameters for statistics, validation, and flow‑control modules from the platform and persisting results.

Handling MySQL failover, schema changes, and sharding scenarios gracefully.

The development strategy emphasized keeping DataX focused on pure data transfer. Business‑specific logic (e.g., metadata lookup, Hive DDL generation) was placed outside DataX, and source code modifications were limited to cases where the sync process could not meet requirements.

For Hive integration, the team wrapped Hive read/write functionality in a thin layer that translates Hive configurations into HDFS configurations and performs necessary DDL operations. Hive reads involve constructing HDFS paths from table names, while Hive writes require fetching file format and delimiter information from the metadata system and supporting table/partition creation.

MySQL‑to‑Hive compatibility was handled by enforcing Hive schema as the source of truth. Non‑partitioned tables are fully re‑imported; mismatched schemas trigger a Hive table drop‑and‑recreate. For partitioned tables, a conservative policy was adopted: extra MySQL fields are ignored, missing fields cause errors, and field order mismatches are corrected.

To work with Youzan's RDS‑managed MySQL clusters, two connection strategies were defined. Direct instance connections provide the best performance for reads, while RDS‑middleware connections are used for writes to avoid replication lag. The platform maintains up‑to‑date primary/replica addresses, performs periodic connectivity checks, and coordinates with DBAs for schema changes.

Operational safeguards include:

Splitting large full‑table scans into many small primary‑key‑based queries to keep each query under 2 seconds, e.g., select ... from table_name where id>? by id asc limit ? .

Introducing adaptive sleep intervals after each batch to prevent overwhelming the database.

Dynamic rate‑limiting based on system metrics such as CPU, disk usage, and binlog delay.

Exception handling was improved by replacing blanket catchException calls with categorized retries for SQL errors, batch failures, and network issues.

Testing enhancements include nightly replay of all critical DataX jobs with a test flag that limits input to a single row and redirects output to a sandbox environment, ensuring both job logic and execution environment remain healthy.

Production experience shows that since its launch in Q2 2017, DataX has run over 6,000 tasks daily, processing more than 10 billion rows, with only minor bugs (e.g., an ORC reader issue fixed by reading all file splits). The system has been stable for over 20 months.

Future work is limited; most remaining items (e.g., dirty‑data readability, HDFS HA) are low‑priority and will not be actively developed. Incremental and real‑time sync needs are addressed by a separate, self‑developed product not covered in this article.

Big DataHiveMySQLData SynchronizationDataXETL
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.