Big Data 17 min read

Implementing Erasure Coding in HDFS: Migration, Testing, and Data Lifecycle Management at JD

This article details JD's end‑to‑end implementation of HDFS erasure coding, covering the migration from replication to EC, the three‑phase upgrade and rollback process, comprehensive automated testing, a custom data‑lifecycle management system for hot‑warm‑cold data, and multi‑layer integrity safeguards to achieve significant storage cost reduction while maintaining reliability.

Big Data Technology Architecture

Mar 25, 2021

Implementing Erasure Coding in HDFS: Migration, Testing, and Data Lifecycle Management at JD

To reduce storage costs and improve efficiency, JD's HDFS team migrated the EC feature into production, developing a data‑lifecycle management system for automated handling of hot, warm, and cold data, and establishing a three‑dimensional data verification mechanism to ensure EC data correctness.

The EC (Erasure Coding) feature, introduced in HDFS 3.0, uses a RS‑3‑2‑1024k strategy that splits a 200 MB file into data and parity stripes, allowing reconstruction from any two missing blocks, reducing storage from 600 MB (three‑replica) to 334 MB, a 45 % saving.

Given the extensive modifications required, JD chose to port EC to their existing 2.7.1 cluster rather than upgrade, defining clear migration principles such as module‑by‑module code transfer, preserving community code style, and ensuring all tests pass.

Quality assurance involved automated integration tests, extensive functional and performance testing, and a custom Ansible‑driven cluster deployment framework, guaranteeing that EC integration does not alter existing interfaces, commands, or cluster operations.

The upgrade and rollback process leveraged HDFS high‑availability, performing a staged upgrade of NameNode and DataNode instances while handling layout version compatibility and ensuring seamless transition without service interruption.

A bespoke data‑lifecycle management system was built to convert warm/cold data to EC storage in‑cluster, using a FileConvertCommand, a ConvertTaskBalancer, and atomic file swaps to maintain metadata and user transparency.

To safeguard data integrity, JD implemented multi‑level verification: file‑level MD5 checks, block‑level parity reconstruction using CodecUtil, and a real‑time block‑level monitoring system that detects and reports any checksum mismatches.

Overall, the project delivered a stable EC‑enabled HDFS platform that now stores hundreds of petabytes of tiered data, saving thousands of servers, and the team contributed numerous patches back to the Hadoop community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

testing Storage Optimization erasure-coding HDFS Data Lifecycle

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.