HDFS Architecture, Optimizations, and Future Plans at Bilibili
Bilibili’s HDFS now runs a three‑tier architecture—access, metadata, and data layers—enhanced with a custom MergeFS router, observer NameNode, dynamic load balancing, fast‑failover pipelines, and storage‑aware policies, while future work targets transparent erasure coding, tiered data routing, lock refinements, and a Hadoop 3.x migration.
Last week we introduced the practical application of YARN scheduling at Bilibili; this week we present the architecture, optimizations, and outlook of HDFS at Bilibili.
HDFS is the offline storage platform underlying Hadoop big‑data computation. It has been in production at Bilibili for over five years, now storing near‑EB of data, hundreds of billions of metadata entries, about 20 NameSpaces, and tens of thousands of nodes, handling dozens of PB of daily throughput.
Overall Architecture
The platform originally consisted of a metadata layer and a data layer. Recently a third "access layer" based on HDFS Router was added, forming a three‑tier architecture:
Access Layer : HDFS Router provides a unified metadata view and maps mount points to clusters, forwarding user requests to the appropriate NameSpace.
Metadata Layer : Stores file metadata and manages read/write operations, the heart of the system.
Data Layer : DataNodes store the actual file blocks.
Below are the key explorations and practices for each layer.
Access Layer
(1) MergeFS for Fast Metadata Expansion
Rapid growth of metadata made a single NameSpace insufficient. After evaluating HDFS federation and community solutions, we built a custom MergeFS on top of the community 3.3 Router, allowing a mount point to map to two NameSpaces and enabling rapid addition of new NameSpaces. We also created a NameSpace Balancer to asynchronously migrate historical data during low‑traffic periods.
Results: expanded to 14 NameSpaces, handling ~9 × 10⁹ metadata objects, increasing QPS from 50 K/s to 177.8 K/s, and reducing migration effort from 1 person/week to 0.1 person/week.
(2) Access‑Layer Rate‑Limiting
To protect high‑priority jobs, we propagate user and job IDs via CallerContext, evaluate permits at the Router, and return StanbyException for throttled requests. Clients retry with exponential back‑off (max 30 retries). On the NameNode side we enable FairCallQueue and a job‑based cost provider for RPC throttling.
(3) Quota Strategy
We moved quota calculation from the NameNode to the Router, caching path‑quota information from the HDFS metadata warehouse (updated daily). A future real‑time quota service based on a NameNode Observer is under development.
(4) Staging Directory Sharding
High‑volume temporary directories (YARN, Spark, Hive, logs) are distributed across multiple NameSpaces using consistent‑hash mount points, balancing QPS load.
Metadata Layer
(1) Observer NameNode
We integrated the community Observer architecture (based on RBF) into our 2.8.4 cluster, merging Hadoop 3.x code. This provides transparent read‑only nodes, improving QPS and reducing NameNode queue time (e.g., queue time dropped from ~23 ms to ~11 ms).
(2) Dynamic Load Balancing
DataNode heartbeats report IO, load, bandwidth, and disk usage. NameNode aggregates scores and selects the less‑loaded DataNode when allocating new blocks.
(3) Delete Protection
We enforce a minimum directory depth for deletions, maintain a dynamic protected‑path list, convert deletions to moves to the recycle bin, and restrict recycle‑bin cleanup to admins or automated processes.
(4) Large‑Delete Optimization
Deletion of massive directories now removes directory entries synchronously, collects block info, and asynchronously instructs DataNodes to delete blocks, reducing lock time from minutes to seconds.
(5) Parallel FsImage Loading
We split the large INode sections of FsImage into sub‑sections and load them in parallel, cutting NameNode startup time from ~80 min to ~50 min.
(6) HDFS Metadata Warehouse
We aggregate file size, read/write, and audit logs into a wide table, enabling data‑growth analysis, governance, and cold‑storage decisions.
(7) Smart Storage Manager (SSM)
Intel’s open‑source SSM integrates with the metadata warehouse to automate cold‑storage migration, EC, and compression policies.
Data Layer
(1) Client Pipeline Recovery & FastFailover
Clients monitor packet latency; slow packets trigger either pipeline reconstruction (Pipeline Recovery) or block termination and start a new block (FastFailover). These strategies reduced slow‑node latency from minutes to milliseconds.
(2) FastSwitchRead
Two mechanisms mitigate slow DataNodes during reads: (a) time‑window throughput‑based node switching, and (b) dynamic read‑socket timeout scaling (starting at 128 ms, doubling up to 30 s).
(3) DataNode Lock Splitting
We async‑ify the heavy processCommand path and replace exclusive locks with read‑write locks, improving RPC throughput (WriteBlockAvgTime significantly reduced).
Multi‑Datacenter Deployment
To overcome rack‑capacity limits, we deployed identical HDFS/YARN clusters across remote data centers, using Router mount points and a RMProxy to route jobs based on user, queue, and data locality, with a Transfer service handling cross‑site data replication and throttling.
Future Plans
Transparent Erasure Coding (EC) : Develop EC Data Proxy, integrate EC patches into the client, deploy a dedicated 3.x HDFS cluster for EC data, and add Router support for EC mount points.
Hot/Cold/Tiered Data Routing : Introduce Alluxio caching, extend Router mount points to support three temperature tiers, and automate tiering based on metadata‑warehouse analysis.
NameNode Lock Optimization : Refine directory lock granularity to further boost read/write performance.
Hadoop 3.x Upgrade : Migrate from 2.x to 3.x clusters to leverage built‑in improvements and support EC.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.