Big Data 16 min read

Applying Erasure Coding in HDFS: Strategies, Performance, and Repair Techniques

This article explains how Zhihu adopted HDFS erasure coding to reduce storage costs, outlines cold‑hot file tiering policies, describes the EC conversion workflow and the custom EC Worker tool, and details methods for detecting and repairing damaged EC files in a Hadoop environment.

DataFunSummit

Feb 12, 2023

Applying Erasure Coding in HDFS: Strategies, Performance, and Repair Techniques

1. Introduction

Erasure Coding (EC) is an encoding technique used to ensure data reliability, commonly in RAID and communication systems; EC files can be decoded even when partially damaged.

In Hadoop 2, HDFS relied on triple replication, which multiplies storage cost. Hadoop 3 introduced EC (e.g., RS‑6‑3‑1024K) to reduce storage, achieving up to 50 % savings compared with replication.

2. EC Strategy

2.1 Limitations

While EC saves storage, it brings performance overhead: reads require multiple blocks from different DataNodes, increasing request volume, and EC files do not support append or truncate, making them unsuitable for hot or frequently updated data.

Under low concurrency, EC read/write performance is worse than replication.

With higher concurrency, EC’s distributed replicas can reduce overall latency.

Consequently, we prefer EC for colder data.

2.2 Cold‑Hot File Tiering

File tiering is based on three attributes: creation time, access time, and access frequency. Files older than i days with low access frequency in the last j days are marked cold; all others are hot. Parameters i , j , k are configurable per directory or Hive partition.

Creation time is extracted from the fsimage; access frequency is derived from NameNode audit logs (only getBlockLocation counts as a read).

We ingest fsimage and audit logs via Kafka and Flink into Hive tables to compute EC‑eligible files.

2.3 File Conversion to EC

HDFS does not provide a direct conversion, so we use cp or distcp to rewrite files into a temporary EC‑enabled directory, then replace the original.

Create a temporary directory with EC encoding.

Copy files, preserving metadata (owner, group, permissions).

Replace the original files.

Conversion can be performed at file granularity or directory granularity; we chose directory granularity to leverage distcp without additional complexity.

3. EC Data Conversion Tool

We built an EC Worker based on distcp with the following capabilities:

Automatic conversion by subscribing to EC‑policy‑generated directory lists.

User impersonation to submit distcp tasks with the correct owner.

Precise concurrency control for simultaneous distcp tasks and per‑task map slots.

Automatic fault tolerance and alerting on failures.

Pre‑ and post‑conversion file verification.

Automatic generation of repair commands for damaged EC files.

EC Worker runs on a modest 4‑core, 16 GB node and can handle thousands of concurrent distcp tasks.

When using distcp for EC conversion, CRC checks must be disabled, and custom verification logic added. For federated HDFS clusters, temporary directories should reside in the same nameservice as the original, e.g., using the prefix .ec_tmp_ (e.g., /a/b/.ec_tmp_c).

4. EC Damaged File Repair

4.1 Detecting Damage

EC files are stored as stripes; for RS‑6‑3 each stripe group contains 9 blocks (6 data, 3 parity). To detect corruption, we read each stripe, recompute parity using Hadoop’s org.apache.hadoop.io.erasurecode.rawcoder.RSRawDecoder, and compare with stored parity. Any mismatch indicates a damaged EC file.

4.2 Locating Damaged Blocks

For a corrupted stripe, we erase three data blocks, decode to recover them, and compare the recovered blocks with the originals to pinpoint which blocks are damaged. If more than two blocks are damaged, the file may be unrecoverable without additional techniques.

4.3 Fast Block Location

DataNode provides the RPC

org.apache.hadoop.hdfs.protocol.ClientDatanodeProtocol#getBlockLocalPathInfo

, which returns the on‑disk path of a block. Access requires the user to be authorized via the dfs.block.local-path-access.user property (e.g., set to hdfs).

EC Worker integrates this lookup and automatically generates commands to delete the damaged block, triggering reconstruction.

5. Summary

Zhihu migrated roughly 25 % of its files to EC, saving about 12 % of total storage. EC works well when block reconstruction is infrequent; however, older Hadoop versions contain many bugs during large‑scale rebuilds. Upgrading to a newer Hadoop release (e.g., 3.3.4) or applying relevant community patches is strongly recommended.

6. Appendix

List of community patches applied to Hadoop 3.2.1 (links omitted for brevity).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Big Data data storage erasure-coding HDFS Repair

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.