Understanding HDFS High Availability: Roles, Metadata Persistence, and Failover
This article explains the core concepts of HDFS High Availability, detailing primary and standby NameNode roles, failover mechanisms, shared storage systems, metadata persistence via EditLog and FsImage, and the processes for merging and synchronizing data across active and standby nodes.
HDFS HA Overview
1.1 Basic Concepts
1.2 Role Introduction
(1) Primary‑Standby NameNode:
Active NameNode provides services to clients; the standby NameNode continuously synchronizes metadata from the active node for failover.
When the active NameNode's state changes, it writes the updates to a shared storage system, allowing the standby to merge them into its memory.
All DataNodes send heartbeat and block reports to both NameNodes simultaneously.
(2) Failover Methods:
Manual failover: executed via commands, useful during HDFS upgrades.
Automatic failover: implemented with ZooKeeper. The ZooKeeper Failover Controller (ZKFC) registers each NameNode, monitors health, and when the active node crashes, ZKFC grants a lock to one standby, which then becomes active.
(3) Shared Storage System:
JournalNode cluster: writes data to all JournalNodes; a majority write success is considered successful. Typically an odd number of JournalNodes (e.g., three) provides fault tolerance, allowing one node to fail.
NFS: a remote file system mounted on each NameNode. Only a single shared edits directory is supported, so its availability becomes a single point of failure. Redundant network paths and storage (disk, network, power) are recommended, preferably using a dedicated NAS device.
2 Metadata Persistence
2.1 Basic Components
EditLog: a transaction log that records every modification to the filesystem metadata.
FsImage: a snapshot of the entire metadata state stored on disk; it is not updated in real time but reflects the filesystem's directory and inode information.
Both EditLog and FsImage are saved on local disks, as illustrated below:
2.2 Characteristics
EditLog offers completeness and low data loss but has slower recovery and can grow large.
FsImage provides fast recovery and size comparable to in‑memory data, but it is not real‑time and may lose recent changes.
NameNode combines both by periodically rolling incremental EditLog entries into FsImage, keeping FsImage up‑to‑date while limiting EditLog size.
2.3 NameNode Startup Loading Process
(1) During HDFS installation, the filesystem is formatted, creating an empty FsImage.
(2) When a NameNode starts, it reads the EditLog and FsImage from disk.
(3) It applies all transactions from the EditLog onto the in‑memory FsImage.
(4) The updated FsImage is written back to local disk.
(5) The old EditLog is deleted because its transactions are now reflected in the new FsImage.
2.4 Merge Process Under HA Architecture
(1) After HA configuration, all client updates are written to the shared JournalNode directory.
(2) Active and standby NameNodes synchronize edits from the JournalNodes.
(3) The StandbyCheckpointer in the standby node periodically checks whether merge conditions are met and, if so, merges FsImage and edits.
(4) After merging, the standby uploads the new FsImage to the active NameNode's directory.
(5) The active NameNode receives the latest FsImage and cleans up old FsImage and edits files.
(6) Both active and standby now hold the latest FsImage and edits, ensuring consistency.
Note: Hadoop 1.x does not support HA; in that version, the Secondary NameNode performs the FsImage‑EditLog merge.
2.5 Directory Contents of Each Role
(1) Active NameNode – ./namenode/current/ contains both edits and FsImage files.
(2) Standby NameNode – ./namenode/current/ contains only FsImage; leftover edits are from previous startups.
(3) JournalNodes – ./journal/current/ contains only edits files, identical to the active NameNode's edits.
3 Related Concepts
Safemode
(1) After startup, the NameNode enters a special Safemode state.
(2) While in Safemode, the NameNode does not replicate data blocks.
(3) The NameNode receives heartbeat and block reports from all DataNodes.
(4) When a block's replica count reaches a configurable minimum, the block is considered safely replicated.
(5) Once a configurable percentage of blocks are safely replicated (plus an additional 30‑second wait), the NameNode exits Safemode.
(6) The NameNode then identifies under‑replicated blocks and initiates replication to other DataNodes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
