Big Data 22 min read

HDFS Architecture, Optimizations, and Future Plans at Bilibili

Bilibili’s HDFS now runs a three‑tier architecture—access, metadata, and data layers—enhanced with a custom MergeFS router, observer NameNode, dynamic load balancing, fast‑failover pipelines, and storage‑aware policies, while future work targets transparent erasure coding, tiered data routing, lock refinements, and a Hadoop 3.x migration.

Bilibili Tech

Mar 30, 2022

HDFS Architecture, Optimizations, and Future Plans at Bilibili

Last week we introduced the practical application of YARN scheduling at Bilibili; this week we present the architecture, optimizations, and outlook of HDFS at Bilibili.

HDFS is the offline storage platform underlying Hadoop big‑data computation. It has been in production at Bilibili for over five years, now storing near‑EB of data, hundreds of billions of metadata entries, about 20 NameSpaces, and tens of thousands of nodes, handling dozens of PB of daily throughput.

Overall Architecture

The platform originally consisted of a metadata layer and a data layer. Recently a third "access layer" based on HDFS Router was added, forming a three‑tier architecture:

Access Layer : HDFS Router provides a unified metadata view and maps mount points to clusters, forwarding user requests to the appropriate NameSpace.

Metadata Layer : Stores file metadata and manages read/write operations, the heart of the system.

Data Layer : DataNodes store the actual file blocks.

Below are the key explorations and practices for each layer.

Access Layer

(1) MergeFS for Fast Metadata Expansion

Rapid growth of metadata made a single NameSpace insufficient. After evaluating HDFS federation and community solutions, we built a custom MergeFS on top of the community 3.3 Router, allowing a mount point to map to two NameSpaces and enabling rapid addition of new NameSpaces. We also created a NameSpace Balancer to asynchronously migrate historical data during low‑traffic periods.

Results: expanded to 14 NameSpaces, handling ~9 × 10⁹ metadata objects, increasing QPS from 50 K/s to 177.8 K/s, and reducing migration effort from 1 person/week to 0.1 person/week.

(2) Access‑Layer Rate‑Limiting

To protect high‑priority jobs, we propagate user and job IDs via CallerContext, evaluate permits at the Router, and return StanbyException for throttled requests. Clients retry with exponential back‑off (max 30 retries). On the NameNode side we enable FairCallQueue and a job‑based cost provider for RPC throttling.

(3) Quota Strategy

We moved quota calculation from the NameNode to the Router, caching path‑quota information from the HDFS metadata warehouse (updated daily). A future real‑time quota service based on a NameNode Observer is under development.

(4) Staging Directory Sharding

High‑volume temporary directories (YARN, Spark, Hive, logs) are distributed across multiple NameSpaces using consistent‑hash mount points, balancing QPS load.

Metadata Layer

(1) Observer NameNode

We integrated the community Observer architecture (based on RBF) into our 2.8.4 cluster, merging Hadoop 3.x code. This provides transparent read‑only nodes, improving QPS and reducing NameNode queue time (e.g., queue time dropped from ~23 ms to ~11 ms).

(2) Dynamic Load Balancing

DataNode heartbeats report IO, load, bandwidth, and disk usage. NameNode aggregates scores and selects the less‑loaded DataNode when allocating new blocks.

(3) Delete Protection

We enforce a minimum directory depth for deletions, maintain a dynamic protected‑path list, convert deletions to moves to the recycle bin, and restrict recycle‑bin cleanup to admins or automated processes.

(4) Large‑Delete Optimization

Deletion of massive directories now removes directory entries synchronously, collects block info, and asynchronously instructs DataNodes to delete blocks, reducing lock time from minutes to seconds.

(5) Parallel FsImage Loading

We split the large INode sections of FsImage into sub‑sections and load them in parallel, cutting NameNode startup time from ~80 min to ~50 min.

(6) HDFS Metadata Warehouse

We aggregate file size, read/write, and audit logs into a wide table, enabling data‑growth analysis, governance, and cold‑storage decisions.

(7) Smart Storage Manager (SSM)

Intel’s open‑source SSM integrates with the metadata warehouse to automate cold‑storage migration, EC, and compression policies.

Data Layer

(1) Client Pipeline Recovery & FastFailover

Clients monitor packet latency; slow packets trigger either pipeline reconstruction (Pipeline Recovery) or block termination and start a new block (FastFailover). These strategies reduced slow‑node latency from minutes to milliseconds.

(2) FastSwitchRead

Two mechanisms mitigate slow DataNodes during reads: (a) time‑window throughput‑based node switching, and (b) dynamic read‑socket timeout scaling (starting at 128 ms, doubling up to 30 s).

(3) DataNode Lock Splitting

We async‑ify the heavy processCommand path and replace exclusive locks with read‑write locks, improving RPC throughput (WriteBlockAvgTime significantly reduced).

Multi‑Datacenter Deployment

To overcome rack‑capacity limits, we deployed identical HDFS/YARN clusters across remote data centers, using Router mount points and a RMProxy to route jobs based on user, queue, and data locality, with a Transfer service handling cross‑site data replication and throttling.

Future Plans

Transparent Erasure Coding (EC) : Develop EC Data Proxy, integrate EC patches into the client, deploy a dedicated 3.x HDFS cluster for EC data, and add Router support for EC mount points.

Hot/Cold/Tiered Data Routing : Introduce Alluxio caching, extend Router mount points to support three temperature tiers, and automate tiering based on metadata‑warehouse analysis.

NameNode Lock Optimization : Refine directory lock granularity to further boost read/write performance.

Hadoop 3.x Upgrade : Migrate from 2.x to 3.x clusters to leverage built‑in improvements and support EC.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Storage Optimization Distributed File System HDFS metadata scaling

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.