How to Migrate HBase and HDFS Clusters Safely Without Downtime
This guide details a step‑by‑step migration plan for HBase and HDFS clusters, covering background, high‑availability architecture, role assignments, expansion and shrinkage of ZooKeeper and JournalNode, NameNode and DataNode migration, rolling restarts, and common upgrade pitfalls.
Background
As business grows and technology evolves, HBase/Hadoop clusters may need to be migrated for cost control, hardware upgrades, or geographic relocation, requiring careful handling of technical details, business continuity, and data consistency.
Challenges
To avoid a single point of failure of the HDFS NameNode and improve high availability, the HDFS must adopt a Multi‑NameNode (MultiNN) architecture with at least three NameNode nodes.
Roles Involved
ZooKeeper: 5 nodes
HBase Master (HMaster): 1 node
HBase RegionServer: n nodes
HBase Thrift‑Server: n nodes
NameNode (HA): 2 nodes
JournalNode: 5 nodes
DFS‑Router: 2 nodes
ZKFC: 2 nodes
Migration Process
Step 1: Expand ZooKeeper
Prerequisites: If more than half of ZooKeeper nodes die, the leader cannot be elected, causing HBase to stop. If only the leader remains, HBase can still run.
Expansion: Add 7 new ZooKeeper nodes to the existing cluster (5 old + 7 new) to avoid quorum loss during later shrinkage.
Step 2: Expand JournalNode
Modify configuration files on all NameNodes (nn1, nn2) and on each JournalNode (hdfs‑site.xml) to include the new JournalNode information.
Restart each JournalNode sequentially, then restart the standby NameNode, perform an active‑standby switch, and finally restart the other NameNode so the new JournalNode configuration is recognized.
Step 3: Migrate NameNode Primary Nodes
Prerequisite: MultiNN (one active, multiple standby) must be in place to guarantee at least two active NameNodes during migration.
Update HA configuration to add a standby node in the new cluster.
Copy the old standby’s fsimage and editlog to the new node’s directory.
Refresh configuration on all DataNodes.
Verify that DataNodes report to the new standby and that fsimage updates propagate to the active node.
Repeat for each NameNode until all are migrated.
This keeps the number of active NameNodes ≥ 2, eliminating single‑point risk.
Step 4: Migrate DataNode (DN)
Expand the new cluster with additional DN nodes.
Decommission old DN nodes gradually; data automatically rebalances to the new machines.
Decommission uses the DN’s own bandwidth, avoiding extra compute resources.
Step 5: RegionServer Rolling Restart
Add Hadoop client HA configuration on new machines (including router address if applicable).
Update ZooKeeper configuration on new machines.
Start RegionServer services on new machines and verify stability.
Stop old RegionServer services one by one, monitoring for issues and rolling back if necessary.
Step 6: Replace HBase Master
Add Hadoop client HA configuration and updated ZooKeeper settings on new machines.
Stop the Master service on old machines.
Start the Master service on new machines.
Monitor for problems and roll back immediately if needed.
Step 7: Shrink ZooKeeper
After migration, retain only the 5 new ZooKeeper nodes and adjust configurations on NameNode, DataNode, HMaster, and RegionServer.
Step 8: Shrink JournalNode
After migration, keep only the 5 new JournalNode nodes and modify the relevant NameNode configurations.
Step 9: RegionServer Rolling Restart (Detailed)
Seamlessly move regions from a RegionServer to other machines, then move the server to a different RSGroup; the business remains almost unaware. <code>move_servers_rsgroup 'dest',['server1:port']</code>
Once all regions are relocated, restart the machine to apply new configuration.
Return the restarted RegionServer to its original RSGroup; after load balancing, service resumes normally.
Step 10: Restart All NameNode Nodes and Switch Active/Standby
Perform a coordinated restart of all NameNode instances and execute an active‑standby role switch.
Upgrade Issues and Optimizations
1. Refresh Node for DataNode
(1) Should refresh node commands target only the new NameService to reduce reporting overhead?
(2) Currently refresh node is asynchronous; would a synchronous, blocking execution be more reliable?
Key log messages indicating command success:
Refresh request received for nameservices
Starting to offer service
Beginning handshake with NN
Successful registration log:
Successfully registered with NN
Errors caused by refresh:
Invalid host name: local host is: (unknown); destination host is …
Namenode for *** remains unresolved for ID null
Initialization failed for Block pool
2. Some Commands Fail
Prerequisite: hdfs://xxxxx resolves to the router’s LVS.
(1) fsck fails because it accesses the NN web port (50070) which is not routed; the LVS must include web‑port mapping.
(2) After upgrade, getReplicatedBlockStats does not work because the method expects the original HDFS domain, not the router domain; the router does not implement this interface.
(3) balance also fails with UnsupportedOperationException: isUpgradeFinalized .
(4) balance aborts because an illegal underscore character appears in the URI.
<code>2024-05-30 02:00:46,832 ERROR org.apache.hadoop.hdfs.server.balancer.Balancer: Exiting balancer due an exception
java.lang.IllegalArgumentException: Illegal character in hostname at index 10: hdfs://xxxxx
at org.apache.hadoop.hdfs.DFSUtil.createUri(DFSUtil.java:1232)
at org.apache.hadoop.hdfs.DFSUtil.getNameServiceUris(DFSUtil.java:820)
at org.apache.hadoop.hdfs.DFSUtil.getInternalNsRpcUris(DFSUtil.java:791)
at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:820)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:968)</code>Note: Monitor hadoop‑work‑balancer‑*.out logs for detailed information.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.