Big Data 12 min read

How to Migrate HBase and HDFS Clusters Safely Without Downtime

This guide details a step‑by‑step migration plan for HBase and HDFS clusters, covering background, high‑availability architecture, role assignments, expansion and shrinkage of ZooKeeper and JournalNode, NameNode and DataNode migration, rolling restarts, and common upgrade pitfalls.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
How to Migrate HBase and HDFS Clusters Safely Without Downtime

Background

As business grows and technology evolves, HBase/Hadoop clusters may need to be migrated for cost control, hardware upgrades, or geographic relocation, requiring careful handling of technical details, business continuity, and data consistency.

Challenges

To avoid a single point of failure of the HDFS NameNode and improve high availability, the HDFS must adopt a Multi‑NameNode (MultiNN) architecture with at least three NameNode nodes.

Roles Involved

ZooKeeper: 5 nodes

HBase Master (HMaster): 1 node

HBase RegionServer: n nodes

HBase Thrift‑Server: n nodes

NameNode (HA): 2 nodes

JournalNode: 5 nodes

DFS‑Router: 2 nodes

ZKFC: 2 nodes

Migration Process

Step 1: Expand ZooKeeper

Prerequisites: If more than half of ZooKeeper nodes die, the leader cannot be elected, causing HBase to stop. If only the leader remains, HBase can still run.

Expansion: Add 7 new ZooKeeper nodes to the existing cluster (5 old + 7 new) to avoid quorum loss during later shrinkage.

Step 2: Expand JournalNode

Modify configuration files on all NameNodes (nn1, nn2) and on each JournalNode (hdfs‑site.xml) to include the new JournalNode information.

Restart each JournalNode sequentially, then restart the standby NameNode, perform an active‑standby switch, and finally restart the other NameNode so the new JournalNode configuration is recognized.

Step 3: Migrate NameNode Primary Nodes

Prerequisite: MultiNN (one active, multiple standby) must be in place to guarantee at least two active NameNodes during migration.

Update HA configuration to add a standby node in the new cluster.

Copy the old standby’s fsimage and editlog to the new node’s directory.

Refresh configuration on all DataNodes.

Verify that DataNodes report to the new standby and that fsimage updates propagate to the active node.

Repeat for each NameNode until all are migrated.

This keeps the number of active NameNodes ≥ 2, eliminating single‑point risk.

Step 4: Migrate DataNode (DN)

Expand the new cluster with additional DN nodes.

Decommission old DN nodes gradually; data automatically rebalances to the new machines.

Decommission uses the DN’s own bandwidth, avoiding extra compute resources.

Step 5: RegionServer Rolling Restart

Add Hadoop client HA configuration on new machines (including router address if applicable).

Update ZooKeeper configuration on new machines.

Start RegionServer services on new machines and verify stability.

Stop old RegionServer services one by one, monitoring for issues and rolling back if necessary.

Step 6: Replace HBase Master

Add Hadoop client HA configuration and updated ZooKeeper settings on new machines.

Stop the Master service on old machines.

Start the Master service on new machines.

Monitor for problems and roll back immediately if needed.

Step 7: Shrink ZooKeeper

After migration, retain only the 5 new ZooKeeper nodes and adjust configurations on NameNode, DataNode, HMaster, and RegionServer.

Step 8: Shrink JournalNode

After migration, keep only the 5 new JournalNode nodes and modify the relevant NameNode configurations.

Step 9: RegionServer Rolling Restart (Detailed)

Seamlessly move regions from a RegionServer to other machines, then move the server to a different RSGroup; the business remains almost unaware. <code>move_servers_rsgroup 'dest',['server1:port']</code>

Once all regions are relocated, restart the machine to apply new configuration.

Return the restarted RegionServer to its original RSGroup; after load balancing, service resumes normally.

Step 10: Restart All NameNode Nodes and Switch Active/Standby

Perform a coordinated restart of all NameNode instances and execute an active‑standby role switch.

Upgrade Issues and Optimizations

1. Refresh Node for DataNode

(1) Should refresh node commands target only the new NameService to reduce reporting overhead?

(2) Currently refresh node is asynchronous; would a synchronous, blocking execution be more reliable?

Key log messages indicating command success:

Refresh request received for nameservices

Starting to offer service

Beginning handshake with NN

Successful registration log:

Successfully registered with NN

Errors caused by refresh:

Invalid host name: local host is: (unknown); destination host is …

Namenode for *** remains unresolved for ID null

Initialization failed for Block pool

2. Some Commands Fail

Prerequisite: hdfs://xxxxx resolves to the router’s LVS.

(1) fsck fails because it accesses the NN web port (50070) which is not routed; the LVS must include web‑port mapping.

(2) After upgrade, getReplicatedBlockStats does not work because the method expects the original HDFS domain, not the router domain; the router does not implement this interface.

(3) balance also fails with UnsupportedOperationException: isUpgradeFinalized .

(4) balance aborts because an illegal underscore character appears in the URI.

<code>2024-05-30 02:00:46,832 ERROR org.apache.hadoop.hdfs.server.balancer.Balancer: Exiting balancer due an exception
java.lang.IllegalArgumentException: Illegal character in hostname at index 10: hdfs://xxxxx
    at org.apache.hadoop.hdfs.DFSUtil.createUri(DFSUtil.java:1232)
    at org.apache.hadoop.hdfs.DFSUtil.getNameServiceUris(DFSUtil.java:820)
    at org.apache.hadoop.hdfs.DFSUtil.getInternalNsRpcUris(DFSUtil.java:791)
    at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:820)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:968)</code>

Note: Monitor hadoop‑work‑balancer‑*.out logs for detailed information.

Big DataHigh AvailabilityHBaseHDFSCluster Migration
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.