Cloud Native 21 min read

How Hierarchical Namespace Boosts Cloud‑Native Data Lake Performance

This article examines the performance challenges of cloud‑native data lakes built on flat object storage and explains how a hierarchical‑namespace design improves directory operations, reduces request amplification, and delivers significant speedups for big‑data and AI workloads.

Baidu Intelligent Cloud Tech Hub

Jun 27, 2023

How Hierarchical Namespace Boosts Cloud‑Native Data Lake Performance

1. Challenges of Cloud‑Native Data Lake on Object Storage

In the past decade, cloud‑native data lakes built on object storage have become the de‑facto standard, offering on‑demand resources, integrated governance, and reduced operational costs.

Resources are provisioned on demand with multiple price‑performance tiers and massive throughput.

Leverages mature cloud ecosystems for governance, risk control, archiving, encryption, and basic compute.

Cloud providers supply professional operations staff, greatly lowering human‑ops costs.

However, large‑scale analytics and AI training expose several performance problems:

Flat namespace metadata makes directory‑style operations (list, rename, stat) inefficient.

Object storage uses HTTP/1.1, which lacks pipelining; HTTP/2 and HTTP/3 are not yet ubiquitous, increasing request latency.

Separate compute and storage introduce additional network latency compared with HDFS inside a VPC.

These factors can amplify simple operations such as stat, rename, or list by hundreds of times, degrading overall job duration.

2. Hierarchical Namespace Technical Practice

Baidu Cloud Storage’s hierarchical‑namespace object storage stores metadata as a directory‑tree, enabling efficient list, rename, and delete operations compared with flat namespace storage.

The service shares the same data plane as flat storage but uses a separate high‑performance metadata cluster.

The hierarchical model compresses metadata, supports billions of objects per bucket, and guarantees strong consistency via Raft‑based multi‑replica replication.

2.1 Metadata Model

Metadata is stored in a tree, allowing a single‑scan directory listing, atomic rename/delete, and one‑hop stat checks, eliminating the multiple RPCs required by flat storage.

2.2 Data and API Model

Clients can create nested directories implicitly and delete non‑empty directories atomically, while preserving compatibility with existing flat‑namespace APIs.

2.3 Performance Optimizations

Write paths use a Raft‑based KV store, batch commits, and path‑level locking to achieve high concurrency. Read paths employ a strong‑consistent directory cache, MVCC snapshot reads, and leader‑lease mechanisms to ensure linearizability.

2.4 Compatibility Layer

Automatic creation of parent directories.

Prefix‑based listing that respects lexical order.

Full support for object‑storage ecosystem features such as lifecycle, image processing, and event callbacks.

2.5 Evaluation

Benchmarks (API latency, NNBench, TPC‑DS) show the hierarchical namespace delivering up to 100 % improvement for directory‑heavy workloads while matching flat storage for pure data reads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data cloud-native Data Lake Object Storage hierarchical namespace

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.