Big Data 15 min read

Fine-Grained Lock Optimization for HDFS NameNode to Improve Metadata Read/Write Performance

To overcome the NameNode write bottleneck caused by a single global read/write lock in Bilibili’s massive HDFS deployment, the team introduced hierarchical fine‑grained locking—splitting the lock into Namespace, BlockPool, and per‑INode levels—which yielded up to three‑fold write throughput gains, a 90 % drop in RPC queue time, and shifted performance limits from lock contention to log synchronization.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Fine-Grained Lock Optimization for HDFS NameNode to Improve Metadata Read/Write Performance

Background: With rapid business growth, the volume of HDFS metadata access requests has increased exponentially. Existing solutions such as HDFS Federation, Router mechanisms, and the Observer NameNode read/write separation have partially alleviated the pressure, but after the number of NameSpaces grew beyond 30, the traditional Active+Standby+Observer architecture can no longer meet all read/write scenarios.

The overall architecture of Bilibili's offline storage (Figure 1‑1) shows more than 30 NameSpaces, EB‑scale storage, and over 20 billion daily requests. Certain workloads (e.g., Flink checkpoint, Spark/MR log upload, data back‑fill) generate far higher write demands, causing the NameNode write‑performance bottleneck.

Problem Statement: The NameNode uses a single global read/write lock (implemented with Java's ReentrantReadWriteLock). While this simplifies the lock model, it becomes a major performance limiter because any write request blocks all other requests.

Design Options: The proposed solution splits the global lock into finer‑grained locks in three steps:

Separate the global lock into a Namespace‑level lock and a BlockPool‑level lock.

Further split the Namespace lock into per‑INode locks.

Split the BlockPool lock into finer‑grained locks as well.

Implementation Details:

Identified three main data structures protected by the global lock: Namespace directory tree, BlockPool data block collection, and cluster node information.

Introduced a BlockManagerLock to handle BlockPool‑level events.

Implemented lock acquisition policies based on request type (Namespace‑only, BlockPool‑only, or both) and ensured a consistent lock ordering (Namespace lock first, then BlockPool lock) to avoid deadlocks.

For the INode‑level lock, a LockPool (inspired by Alluxio) is used to limit memory consumption; an INodeLockManager maps INodes to lock objects.

Defined three lock types (Read, Write_INode, Write_Edge) and described lock acquisition sequences for common RPCs such as getListing, create, mkdir, rename, and block reports.

Performance Results: After deploying the first two steps (Namespace‑level and BlockPool‑level lock separation), production tests showed roughly a 50 % improvement in write performance for a single NameSpace. After further splitting the Namespace lock to INode granularity, write throughput increased by about threefold, RPC queue time dropped by 90 %, and overall NameNode performance became limited by edit log and audit log synchronization rather than lock contention.

Conclusion and Outlook: The lock‑splitting optimizations have been stable in production and significantly improved metadata access performance. Future work includes further fine‑graining the BlockPool lock and exploring metadata persistence to RocksDB or KV stores (or adopting Ozone) to address the memory limitation of in‑memory metadata, especially for small‑file workloads.

Big DataPerformancemetadataHDFSlock optimizationNameNode
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.