Bilibili HDFS Erasure Coding Strategy and Implementation
Bilibili reduced petabyte‑scale storage costs by back‑porting erasure‑coding patches to its HDFS 2.8.4 cluster, deploying a parallel EC‑enabled cluster, adding a data‑proxy service, intelligent routing and block‑checking, and automating cold‑data migration, while noting write overhead and planning native acceleration.
Background: With Bilibili’s rapid business growth, the amount of data generated reaches petabyte scale daily. Cold data, which accounts for more than 30% of total storage, occupies a large portion of high‑density storage despite having limited access frequency. To further reduce storage costs, Bilibili decided to adopt Erasure Coding (EC) for HDFS.
What is EC? Erasure Code (EC) is a forward error correction technology that can achieve higher data reliability with lower redundancy compared to traditional triple‑replication. EC encodes n original data blocks into n+m blocks; any n blocks can reconstruct the original data, tolerating up to m block failures.
Impact on storage: For a 256 MB file, triple‑replication consumes 768 MB, while an RS‑6‑3‑1024k EC scheme reduces the consumption to 384 MB, cutting storage by about 50%.
2. HDFS EC Overall Solution
2.1 Problems
The current HDFS cluster is version 2.8.4, which does not support EC. Upgrading to 3.x is complex and resource‑intensive.
Various HDFS clients (Java, C++, Presto) are in use, making a unified client version unrealistic in the short term.
2.2 Design Choice
To avoid a full cluster upgrade, Bilibili back‑ported necessary EC patches to the 2.8.4 codebase and deployed a parallel HDFS 3.3 EC cluster. The architecture is illustrated in Figure 2‑1.
2.3 Overall Process
The workflow (Figure 2‑2) includes deploying the EC‑enabled 3.3 cluster, adding EC mount points in the HDFS Router, back‑porting EC patches to the 2.8.4 client, and building an EC Data Proxy service for legacy clients.
3. HDFS EC Implementation Details
3.1 DN Proxy Implementation
Bilibili developed a Data Proxy service that allows all 2.8.4 clients (including C++ and Python) to read EC data transparently. The process is:
Patch the 2.8.4 client to support EC read/write and embed client version information in CallerContext.
The EC‑enabled NameNode validates the client version; older or unknown clients are redirected to the Data Proxy, while newer clients connect directly to DataNode Xceiver.
On the DataNode side, an ErasureCodingDataXceiver with a dedicated port aggregates blocks from multiple DNs and returns a reconstructed stream following the Replication‑Based Data Transfer Protocol.
3.2 Intelligent Data Routing System
Based on the existing HDFS Router, a custom EC mount point was added to route new writes to hot namespaces while keeping historical EC data accessible. An automated EC migration tool moves cold data to EC‑enabled namespaces during low‑traffic periods, guided by metadata analysis.
3.3 Block Checker
The Block Checker component ensures the integrity of reconstructed EC blocks. It records CRC checksums of each block’s metadata in a KV store after finalization, then recomputes checksums during reconstruction and compares them to detect corruption.
3.4 Intelligent EC Cold‑Backup Strategy
Cold data is identified by three criteria: (1) file age exceeds a cold‑backup threshold, (2) recent OPEN operation count below a threshold, and (3) recent read count below a threshold. The conversion workflow (Figure 3‑4) records cold‑backup flags in the HDFS metastore, filters partitions larger than 3 MB, and re‑converts partitions that have been altered by other data‑governance jobs.
4. Summary and Outlook
Since the rollout, hundreds of petabytes have been converted to EC, saving the cost of thousands of servers. While EC reduces storage cost, it introduces write‑performance overhead, is unsuitable for very small files, and increases the number of blocks per DataNode. Future work includes native acceleration of EC encoding/decoding, further stability improvements, and contributing back to the open‑source Hadoop community.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.