Understanding HDFS EditLog Format and Quorum Journal Manager Recovery Process
This article explains the HDFS EditLog file structure, the design of the Quorum Journal Manager for high‑availability, the write‑path optimizations such as batch flushing and double‑buffering, and the detailed Multi‑Paxos based recovery algorithm including isolation, segment selection, prepare and accept phases, and handling journal node failures.
Background – HDFS NameNode records write operations in an EditLog. In a single‑node setup the log and local image allow recovery after a crash, but in HA mode a single log machine becomes a single point of failure, leading to data inconsistency.
File format – EditLog consists of a header (version and transaction marker) followed by entries: 1‑byte operation type, 4‑byte length, 8‑byte transaction ID, the operation payload, and a 4‑byte checksum, ending with a transaction marker.
Journal Node cluster – QJM (Quorum Journal Manager) uses a three‑node JournalNode cluster based on Paxos; a write is considered successful when a majority (quorum) acknowledges it, enabling asynchronous replication.
Write path – Active NameNode writes logs to Journal Nodes via long‑lived RPC, while Standby NameNode synchronizes finalized logs via HTTP. Optimizations include batch disk flushing (similar to MySQL group commit) and a double‑buffer scheme (bufCurrent for incoming logs, bufReady for flushing) to improve throughput.
Recovery process – After an Active NameNode crash, the Standby must recover by isolating the old leader (newEpoch), selecting a reliable segment from Journal Nodes, and executing a Multi‑Paxos based recovery consisting of PrepareRecovery and AcceptRecovery phases.
Isolation (newEpoch) – The new epoch is computed as the maximum epoch among Journal Nodes plus one; Journal Nodes reject writes from the old leader with a smaller epoch.
Selecting a recovery source – The algorithm prefers existing segments, checks startTxid consistency, favors finalized over in‑progress segments, validates epoch and length, and finally chooses the longest segment when epochs match.
PrepareRecovery (P1) – The active node sends a proposal to all Journal Nodes, gathers responses containing segment existence, status, committedTxnId, lastWriterEpoch, and AcceptedInEpoch. The majority response determines the segment to recover.
AcceptRecovery (P2) – An accept request is sent for the chosen segment. Journal Nodes verify the epoch, ensure the segment contains the required transactions, and check whether the previous Paxos round completed successfully. If the local segment is missing or mismatched, the node synchronizes data from peers.
Finalization – Once a majority of Journal Nodes acknowledge the accept, the recovery is finalized by renaming the segment files so the NameNode can read them.
Journal Node failures – The system tolerates a single Journal Node crash; the node is marked outOfSync and resumes receiving logs after the next startLogSegment RPC when it becomes healthy again.
Author – Peng Rongxin, architect at Shanghai Oudian Cloud Information Technology Co., focuses on distributed storage and concurrency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
