Big Data 14 min read

How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX

This article explains Zhihu's journey from ad‑hoc MySQL‑Hive sync using Oozie + Sqoop to a unified, platform‑based data synchronization service that now handles thousands of tables, over 10 TB daily, with load‑aware scheduling, incremental pulls, schema change handling, and tight integration with their offline job scheduler.

dbaplus Community
dbaplus Community
dbaplus Community
How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX

1. Business Scenario and Architecture

Zhihu's online services store data primarily in MySQL, while the offline data warehouse relies on Hive. Direct OLAP queries on the online databases are unsafe, so a reliable data‑sync layer is needed to feed the warehouse without impacting production workloads.

Early sync used Oozie + Sqoop, which met basic needs but suffered from duplicate jobs, lack of load management, and MySQL overload during nightly peaks. To solve visibility, configuration, and load‑balancing problems, Zhihu built a unified data‑sync platform that now supports over a thousand tables and more than 10 TB of data per day.

2. Technology Selection

Two open‑source batch sync tools were evaluated:

Sqoop – MapReduce‑based, good Hive compatibility, automatic schema migration, strong community support; drawbacks: limited source support and no built‑in throttling.

DataX – Rich source/target ecosystem, supports throttling, easy plugin development; drawbacks: requires extra runtime resources and lacks native Hive export.

Considering resource consumption and tighter Hadoop integration, Sqoop was chosen as the primary sync engine.

3. Platform Design and Practices

The platform aims to provide a generic sync service that simplifies onboarding, offers monitoring/alerting, shields MySQL DDL changes, and allows extensible new sources.

Key components (illustrated in the architecture diagram):

API Server – UI and RESTful API.

Data Source Registry – stores source metadata and refreshes it regularly.

Scheduler – plans task execution, protects MySQL from overload.

Workers – distributed executors that run the actual sync jobs.

Important features:

Simplified task onboarding – users describe source, sink, schedule, etc., without needing to understand underlying sync mechanics; tasks are reviewed before activation.

Incremental sync – only new rows are transferred when tables satisfy either “append‑only” or “soft‑delete with timestamp” patterns; tables with ≤ 20 million rows are usually fully synced.

Schema change handling – the platform snapshots source schemas, detects DDL, and propagates compatible changes downstream while blocking destructive operations.

Scheduler integration – after a sync finishes, the platform notifies Zhihu's offline job scheduler to trigger dependent ETL tasks, eliminating manual coordination.

Monitoring & alerting – tracks MySQL load, IOPS, network bandwidth, Yarn queue lengths, and job error counts; alerts focus on queue saturation and sync failures.

4. Platform Optimizations and Practices

Resource management – dedicated MySQL replica for sync, centralised resource scheduler (similar to YARN) with persistent state, and per‑instance queues to avoid over‑committing IOPS.

Storage format choice – Parquet was selected over ORC for compatibility with Impala, despite ORC’s tighter Hive integration.

Dynamic concurrency – Sqoop’s default 4 mappers is insufficient for large tables; the platform adjusts mapper count based on source metadata.

Distributed Cache optimization – the heavy Hadoop lib cache caused HDFS pressure; the platform disables it via --skip-distcache to reduce job‑submission latency.

Speculative execution control – disabled to prevent duplicate MySQL reads caused by task skew and to avoid unnecessary I/O spikes.

5. Outlook

Future work includes automated detection and cleanup of obsolete sync tasks, expanding source support to HBase, Elasticsearch, and real‑time streaming sync (e.g., Kafka → Kudu/Elasticsearch) to complement batch pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataHivemysqldata synchronizationDataXETLSqoop
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.