Applying Erasure Coding in HDFS: Strategies, Performance, and Repair Techniques
This article explains how Zhihu adopted HDFS erasure coding to reduce storage costs, outlines cold‑hot file tiering policies, describes the EC conversion workflow and the custom EC Worker tool, and details methods for detecting and repairing damaged EC files in a Hadoop environment.
1. Introduction
Erasure Coding (EC) is an encoding technique used to ensure data reliability, commonly in RAID and communication systems; EC files can be decoded even when partially damaged.
In Hadoop 2, HDFS relied on triple replication, which multiplies storage cost. Hadoop 3 introduced EC (e.g., RS‑6‑3‑1024K) to reduce storage, achieving up to 50 % savings compared with replication.
2. EC Strategy
2.1 Limitations
While EC saves storage, it brings performance overhead: reads require multiple blocks from different DataNodes, increasing request volume, and EC files do not support append or truncate, making them unsuitable for hot or frequently updated data.
Under low concurrency, EC read/write performance is worse than replication.
With higher concurrency, EC’s distributed replicas can reduce overall latency.
Consequently, we prefer EC for colder data.
2.2 Cold‑Hot File Tiering
File tiering is based on three attributes: creation time, access time, and access frequency. Files older than i days with low access frequency in the last j days are marked cold; all others are hot. Parameters i , j , k are configurable per directory or Hive partition.
Creation time is extracted from the fsimage; access frequency is derived from NameNode audit logs (only getBlockLocation counts as a read).
We ingest fsimage and audit logs via Kafka and Flink into Hive tables to compute EC‑eligible files.
2.3 File Conversion to EC
HDFS does not provide a direct conversion, so we use cp or distcp to rewrite files into a temporary EC‑enabled directory, then replace the original.
Create a temporary directory with EC encoding.
Copy files, preserving metadata (owner, group, permissions).
Replace the original files.
Conversion can be performed at file granularity or directory granularity; we chose directory granularity to leverage distcp without additional complexity.
3. EC Data Conversion Tool
We built an EC Worker based on distcp with the following capabilities:
Automatic conversion by subscribing to EC‑policy‑generated directory lists.
User impersonation to submit distcp tasks with the correct owner.
Precise concurrency control for simultaneous distcp tasks and per‑task map slots.
Automatic fault tolerance and alerting on failures.
Pre‑ and post‑conversion file verification.
Automatic generation of repair commands for damaged EC files.
EC Worker runs on a modest 4‑core, 16 GB node and can handle thousands of concurrent distcp tasks.
When using distcp for EC conversion, CRC checks must be disabled, and custom verification logic added. For federated HDFS clusters, temporary directories should reside in the same nameservice as the original, e.g., using the prefix .ec_tmp_ (e.g., /a/b/.ec_tmp_c ).
4. EC Damaged File Repair
4.1 Detecting Damage
EC files are stored as stripes; for RS‑6‑3 each stripe group contains 9 blocks (6 data, 3 parity). To detect corruption, we read each stripe, recompute parity using Hadoop’s org.apache.hadoop.io.erasurecode.rawcoder.RSRawDecoder , and compare with stored parity. Any mismatch indicates a damaged EC file.
4.2 Locating Damaged Blocks
For a corrupted stripe, we erase three data blocks, decode to recover them, and compare the recovered blocks with the originals to pinpoint which blocks are damaged. If more than two blocks are damaged, the file may be unrecoverable without additional techniques.
4.3 Fast Block Location
DataNode provides the RPC org.apache.hadoop.hdfs.protocol.ClientDatanodeProtocol#getBlockLocalPathInfo , which returns the on‑disk path of a block. Access requires the user to be authorized via the dfs.block.local-path-access.user property (e.g., set to hdfs ).
EC Worker integrates this lookup and automatically generates commands to delete the damaged block, triggering reconstruction.
5. Summary
Zhihu migrated roughly 25 % of its files to EC, saving about 12 % of total storage. EC works well when block reconstruction is infrequent; however, older Hadoop versions contain many bugs during large‑scale rebuilds. Upgrading to a newer Hadoop release (e.g., 3.3.4) or applying relevant community patches is strongly recommended.
6. Appendix
List of community patches applied to Hadoop 3.2.1 (links omitted for brevity).
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.