Big Data 21 min read

How We Migrated a 200‑Node Hadoop Cluster Across Data Centers: Lessons and Strategies

This article presents a comprehensive case study of migrating a 200‑plus node Hadoop offline platform across data centers, covering background, architecture, solution evaluation, detailed implementation steps, consistency checks, operational safeguards, encountered issues, and future recommendations.

Youzan Coder

Jul 29, 2020

How We Migrated a 200‑Node Hadoop Cluster Across Data Centers: Lessons and Strategies

Background

The existing Hadoop offline platform served over 200 physical machines and more than 40,000 daily scheduled tasks. By late 2019 the data center could no longer accommodate growth, so a full migration to a new cloud provider was required.

Technical Architecture

The platform consists of three logical layers:

Hadoop ecosystem components (HDFS, YARN, Spark, Hive, Presto, HBase, Kafka, Kylin, etc.)

Foundational services (Airflow for scheduling, DataX for batch sync, binlog‑based incremental sync, SQL parsing/execution services, monitoring and diagnostics)

Platform‑level products: Data Development Platform (DP), Asset Management Platform, Data Visualization Platform, Algorithm Training Platform

Migration Options

1. Single‑Cluster Approach

Both data centers share a single Hadoop cluster (one active NameNode, DataNodes deployed in both sites). Two variants were studied:

Plan A : Gradually expand DataNodes in the new site while shrinking the old site, then use the native HDFS Balancer to rebalance block replicas and finally switch the active NameNode. Simple but requires large cross‑site bandwidth for shuffle and file reads.

Plan B : Leverage Hadoop Rack Awareness to place block replicas across sites (e.g., 1:2 ratio for three replicas) or develop a custom tool that reads the FSImage, inspects replica locations and rebalances under throttling.

2. Multi‑Cluster Approach

A new Hadoop cluster is built in the target data center. Initial full copy of HDFS data is performed with DistCp, followed by incremental sync until both clusters are identical, then the old cluster is decommissioned.

Plan C : After full data sync, switch all references in DP to the new cluster in one step. Fast but high risk if consistency is not perfect.

Plan D : Migrate layer by layer (ODS → DW → DM) and by business line, reducing risk and allowing partial roll‑backs.

Evaluation and Decision

Although the single‑cluster design offers better user transparency, limited cross‑site bandwidth and the need for infrastructure changes made the multi‑cluster approach more feasible. Plan D (layered multi‑cluster migration) was selected.

Implementation Process

3.1 Full HDFS Data Copy

A new Hadoop cluster was provisioned, performance‑tested, and capacity‑sized. Using DistCp with bandwidth limiting ( -bandwidth) and update mode ( -update), a two‑week full copy was performed while the source cluster ran at low load. The new cluster initially used a single replica to accelerate transfer.

3.2 Offline Task Migration

All offline tasks are managed through the DP platform. Because two Hadoop clusters coexist, the migration focused on keeping DP task definitions and states consistent.

3.2.1 DP Platform Overview

The DP platform supports the following task types:

Full/incremental MySQL → Hive imports

Binlog‑based incremental imports (binlog → Canal → NSQ → Flume → HDFS → Hive)

Exports (Hive → MySQL/Elasticsearch/HBase)

Hive SQL and Spark SQL jobs

Spark JAR and MapReduce jobs

Custom script tasks

3.2.2 Ensuring Task State Consistency

When a user creates or updates a task in the old DP, an event is emitted to Kafka. The new DP subscribes to these events and replays them, keeping both environments synchronized.

3.2.3 DP Migration State Machine

The migration UI presents two execution modes for each workflow:

Dual‑run : Tasks run on both old and new clusters simultaneously.

Full migration : After all downstream workflows have been migrated, tasks run only on the new cluster.

Recommended execution options per task type:

MySQL → Hive imports: dual‑run (both clusters pull from the source MySQL).

Hive / SparkSQL: dual‑run.

MapReduce / Spark JAR: user decides based on idempotence and configuration.

Exports: single‑run on the new cluster after Hive table consistency is verified.

3.3 Business‑Side Migration Order

Migration was driven by task dependency layers:

ODS import tasks: full dual‑run.

DW layer (Hive / SparkSQL): dual‑run after Hive table consistency checks.

DM layer tasks: switched to the new cluster once upstream consistency is confirmed.

After all workflows are fully migrated, ODS and DW tasks are paused in the old environment.

MapReduce, Spark JAR, and script tasks require manual assessment.

Process Guarantees

Tool Stability

Metadata sync bugs are detected by comparing DP metadata between environments and raising alerts on mismatches.

Airflow script inconsistencies after publishing are detected by diffing scripts between the two clusters.

After migration the old DP is paused to prevent dirty export data.

Periodic checks ensure workflow pause status matches the database configuration.

Hive Table Consistency

After each dual‑run execution the system records <Task T, Output Table A> for data‑quality checks. When both old and new tables are ready, a MapReduce job reads the ORC files, uses the user‑defined primary key as the shuffle key, and compares records. Differences are reported to table owners for rapid debugging.

Issues Encountered

Missing -p flag in DistCp caused HDFS file ownership mismatches.

Incorrect DistCp usage overwrote HBase clusterId, breaking cross‑cluster sync.

Hive table statistics (e.g., totalSize) were lost after export/import; resolved by running ANALYZE TABLE … COMPUTE STATISTICS post‑copy.

Heavy MapReduce jobs for Hive consistency checks saturated the cluster; they were moved to off‑peak windows.

Circular workflow dependencies blocked dual‑run → full migration; required refactoring of dependencies.

When upstream tasks were re‑run only in the old DP, downstream tasks in the new DP became stale; safeguards were added to synchronize upstream re‑runs.

MapReduce and Spark JAR tasks lacked automatic dependency detection; users were asked to manually declare input/output Hive tables.

Conclusion

The migration spanned six months (four months preparation, two months execution), moving petabytes of data and 40,000 daily tasks with minimal disruption. Success was achieved through thorough option evaluation, a robust DP‑driven migration framework, staged rollout, and proactive issue resolution. Future work includes tighter platform governance to avoid environment‑specific code coupling, further automation to make migrations transparent to end users, and exploring single‑cluster capabilities for bandwidth‑constrained scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Migration Big Data workflow Data Consistency Hadoop Distcp DP Platform

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.