How Hierarchical Namespace Boosts Cloud‑Native Data Lake Performance
This article examines the performance challenges of cloud‑native data lakes built on flat object storage and explains how a hierarchical‑namespace design improves directory operations, reduces request amplification, and delivers significant speedups for big‑data and AI workloads.
1. Challenges of Cloud‑Native Data Lake on Object Storage
In the past decade, cloud‑native data lakes built on object storage have become the de‑facto standard, offering on‑demand resources, integrated governance, and reduced operational costs.
Resources are provisioned on demand with multiple price‑performance tiers and massive throughput.
Leverages mature cloud ecosystems for governance, risk control, archiving, encryption, and basic compute.
Cloud providers supply professional operations staff, greatly lowering human‑ops costs.
However, large‑scale analytics and AI training expose several performance problems:
Flat namespace metadata makes directory‑style operations (list, rename, stat) inefficient.
Object storage uses HTTP/1.1, which lacks pipelining; HTTP/2 and HTTP/3 are not yet ubiquitous, increasing request latency.
Separate compute and storage introduce additional network latency compared with HDFS inside a VPC.
These factors can amplify simple operations such as stat, rename, or list by hundreds of times, degrading overall job duration.
2. Hierarchical Namespace Technical Practice
Baidu Cloud Storage’s hierarchical‑namespace object storage stores metadata as a directory‑tree, enabling efficient list, rename, and delete operations compared with flat namespace storage.
The service shares the same data plane as flat storage but uses a separate high‑performance metadata cluster.
The hierarchical model compresses metadata, supports billions of objects per bucket, and guarantees strong consistency via Raft‑based multi‑replica replication.
2.1 Metadata Model
Metadata is stored in a tree, allowing a single‑scan directory listing, atomic rename/delete, and one‑hop stat checks, eliminating the multiple RPCs required by flat storage.
2.2 Data and API Model
Clients can create nested directories implicitly and delete non‑empty directories atomically, while preserving compatibility with existing flat‑namespace APIs.
2.3 Performance Optimizations
Write paths use a Raft‑based KV store, batch commits, and path‑level locking to achieve high concurrency. Read paths employ a strong‑consistent directory cache, MVCC snapshot reads, and leader‑lease mechanisms to ensure linearizability.
2.4 Compatibility Layer
Automatic creation of parent directories.
Prefix‑based listing that respects lexical order.
Full support for object‑storage ecosystem features such as lifecycle, image processing, and event callbacks.
2.5 Evaluation
Benchmarks (API latency, NNBench, TPC‑DS) show the hierarchical namespace delivering up to 100 % improvement for directory‑heavy workloads while matching flat storage for pure data reads.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
