Fine-Grained Lock Optimization for HDFS NameNode to Improve Metadata Read/Write Performance
To overcome the NameNode write bottleneck caused by a single global read/write lock in Bilibili’s massive HDFS deployment, the team introduced hierarchical fine‑grained locking—splitting the lock into Namespace, BlockPool, and per‑INode levels—which yielded up to three‑fold write throughput gains, a 90 % drop in RPC queue time, and shifted performance limits from lock contention to log synchronization.
Background: With rapid business growth, the volume of HDFS metadata access requests has increased exponentially. Existing solutions such as HDFS Federation, Router mechanisms, and the Observer NameNode read/write separation have partially alleviated the pressure, but after the number of NameSpaces grew beyond 30, the traditional Active+Standby+Observer architecture can no longer meet all read/write scenarios.
The overall architecture of Bilibili's offline storage (Figure 1‑1) shows more than 30 NameSpaces, EB‑scale storage, and over 20 billion daily requests. Certain workloads (e.g., Flink checkpoint, Spark/MR log upload, data back‑fill) generate far higher write demands, causing the NameNode write‑performance bottleneck.
Problem Statement: The NameNode uses a single global read/write lock (implemented with Java's ReentrantReadWriteLock). While this simplifies the lock model, it becomes a major performance limiter because any write request blocks all other requests.
Design Options: The proposed solution splits the global lock into finer‑grained locks in three steps:
Separate the global lock into a Namespace‑level lock and a BlockPool‑level lock.
Further split the Namespace lock into per‑INode locks.
Split the BlockPool lock into finer‑grained locks as well.
Implementation Details:
Identified three main data structures protected by the global lock: Namespace directory tree, BlockPool data block collection, and cluster node information.
Introduced a BlockManagerLock to handle BlockPool‑level events.
Implemented lock acquisition policies based on request type (Namespace‑only, BlockPool‑only, or both) and ensured a consistent lock ordering (Namespace lock first, then BlockPool lock) to avoid deadlocks.
For the INode‑level lock, a LockPool (inspired by Alluxio) is used to limit memory consumption; an INodeLockManager maps INodes to lock objects.
Defined three lock types (Read, Write_INode, Write_Edge) and described lock acquisition sequences for common RPCs such as getListing, create, mkdir, rename, and block reports.
Performance Results: After deploying the first two steps (Namespace‑level and BlockPool‑level lock separation), production tests showed roughly a 50 % improvement in write performance for a single NameSpace. After further splitting the Namespace lock to INode granularity, write throughput increased by about threefold, RPC queue time dropped by 90 %, and overall NameNode performance became limited by edit log and audit log synchronization rather than lock contention.
Conclusion and Outlook: The lock‑splitting optimizations have been stable in production and significantly improved metadata access performance. Future work includes further fine‑graining the BlockPool lock and exploring metadata persistence to RocksDB or KV stores (or adopting Ozone) to address the memory limitation of in‑memory metadata, especially for small‑file workloads.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.