Big Data 17 min read

Implementing Erasure Coding in HDFS: Migration, Testing, and Data Lifecycle Management at JD

This article details JD's end‑to‑end implementation of HDFS erasure coding, covering the migration from replication to EC, the three‑phase upgrade and rollback process, comprehensive automated testing, a custom data‑lifecycle management system for hot‑warm‑cold data, and multi‑layer integrity safeguards to achieve significant storage cost reduction while maintaining reliability.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Implementing Erasure Coding in HDFS: Migration, Testing, and Data Lifecycle Management at JD

To reduce storage costs and improve efficiency, JD's HDFS team migrated the EC feature into production, developing a data‑lifecycle management system for automated handling of hot, warm, and cold data, and establishing a three‑dimensional data verification mechanism to ensure EC data correctness.

The EC (Erasure Coding) feature, introduced in HDFS 3.0, uses a RS‑3‑2‑1024k strategy that splits a 200 MB file into data and parity stripes, allowing reconstruction from any two missing blocks, reducing storage from 600 MB (three‑replica) to 334 MB, a 45 % saving.

Given the extensive modifications required, JD chose to port EC to their existing 2.7.1 cluster rather than upgrade, defining clear migration principles such as module‑by‑module code transfer, preserving community code style, and ensuring all tests pass.

Quality assurance involved automated integration tests, extensive functional and performance testing, and a custom Ansible‑driven cluster deployment framework, guaranteeing that EC integration does not alter existing interfaces, commands, or cluster operations.

The upgrade and rollback process leveraged HDFS high‑availability, performing a staged upgrade of NameNode and DataNode instances while handling layout version compatibility and ensuring seamless transition without service interruption.

A bespoke data‑lifecycle management system was built to convert warm/cold data to EC storage in‑cluster, using a FileConvertCommand, a ConvertTaskBalancer, and atomic file swaps to maintain metadata and user transparency.

To safeguard data integrity, JD implemented multi‑level verification: file‑level MD5 checks, block‑level parity reconstruction using CodecUtil, and a real‑time block‑level monitoring system that detects and reports any checksum mismatches.

Overall, the project delivered a stable EC‑enabled HDFS platform that now stores hundreds of petabytes of tiered data, saving thousands of servers, and the team contributed numerous patches back to the Hadoop community.

Big DataTestingStorage OptimizationErasure CodingHDFSdata lifecycle
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.