Implementing Erasure Coding in HDFS: Migration, Testing, and Data Lifecycle Management at JD
This article details JD's end‑to‑end implementation of HDFS erasure coding, covering the migration from replication to EC, the three‑phase upgrade and rollback process, comprehensive automated testing, a custom data‑lifecycle management system for hot‑warm‑cold data, and multi‑layer integrity safeguards to achieve significant storage cost reduction while maintaining reliability.
To reduce storage costs and improve efficiency, JD's HDFS team migrated the EC feature into production, developing a data‑lifecycle management system for automated handling of hot, warm, and cold data, and establishing a three‑dimensional data verification mechanism to ensure EC data correctness.
The EC (Erasure Coding) feature, introduced in HDFS 3.0, uses a RS‑3‑2‑1024k strategy that splits a 200 MB file into data and parity stripes, allowing reconstruction from any two missing blocks, reducing storage from 600 MB (three‑replica) to 334 MB, a 45 % saving.
Given the extensive modifications required, JD chose to port EC to their existing 2.7.1 cluster rather than upgrade, defining clear migration principles such as module‑by‑module code transfer, preserving community code style, and ensuring all tests pass.
Quality assurance involved automated integration tests, extensive functional and performance testing, and a custom Ansible‑driven cluster deployment framework, guaranteeing that EC integration does not alter existing interfaces, commands, or cluster operations.
The upgrade and rollback process leveraged HDFS high‑availability, performing a staged upgrade of NameNode and DataNode instances while handling layout version compatibility and ensuring seamless transition without service interruption.
A bespoke data‑lifecycle management system was built to convert warm/cold data to EC storage in‑cluster, using a FileConvertCommand, a ConvertTaskBalancer, and atomic file swaps to maintain metadata and user transparency.
To safeguard data integrity, JD implemented multi‑level verification: file‑level MD5 checks, block‑level parity reconstruction using CodecUtil, and a real‑time block‑level monitoring system that detects and reports any checksum mismatches.
Overall, the project delivered a stable EC‑enabled HDFS platform that now stores hundreds of petabytes of tiered data, saving thousands of servers, and the team contributed numerous patches back to the Hadoop community.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.