Big Data 14 min read

Can Data Lakes Combine Compute and Storage? Exploring HDFS, S3A, and UMStor Hadapter

This article examines the evolution of data lake architectures, comparing the compute‑storage fusion model of HDFS, the compute‑storage separation approach of S3A on Ceph, and a new UMStor Hadapter plugin that aims to unite their strengths while addressing performance bottlenecks.

UCloud Tech

May 22, 2018

Can Data Lakes Combine Compute and Storage? Exploring HDFS, S3A, and UMStor Hadapter

Introduction

For many people born before the new millennium, the allure of martial‑arts stories mirrors the excitement of the tech world, where systems such as Windows vs. Linux or private cloud vs. IOE clash like rival schools.

Two Factions in Data Lakes

The term “data lake” was first coined in 2011 by Dan Woods. A data lake is an information system that (1) can store massive parallel data and (2) allows computation without moving the data elsewhere.

In practice, data lakes fall into three patterns:

1. Compute‑Storage Fusion (Unified Cluster)

Compute and storage resources are combined in a single cluster. As business lines grow, they compete for compute resources, and scaling requires expanding both compute and storage together, which can be inconvenient.

2. Compute‑Storage Fusion Pro (Isolated Clusters)

Each business line receives its own isolated data‑lake cluster, eliminating compute contention but creating data silos because the same dataset must be copied to multiple clusters, increasing storage cost and replication overhead.

3. Compute‑Storage Separation

Compute clusters are separate from a shared storage pool, solving the data‑duplication problem and allowing independent scaling of compute and storage, which aligns with elastic computing principles.

Compute‑Storage Fusion – HDFS

HDFS (Hadoop Distributed File System) exemplifies the fusion model. Writing data involves six steps:

Client sends a create‑file request to the NameNode.

NameNode validates and authorizes the request.

Client splits the file into blocks (default 128 MB).

NameNode returns DataNode locations for each block (typically three replicas).

Client streams each block to the chosen DataNodes via a pipeline.

Process repeats until the entire file is stored.

HDFS relies on data locality: computation is scheduled on nodes that hold the required data, reducing network traffic and improving read performance. However, large block sizes can cause poor data balance and under‑utilize cluster storage.

Compute‑Storage Separation – S3A on Ceph

In private‑cloud deployments, Ceph’s object storage (accessed via RGW’s S3 interface) provides a shared backend. Applications use the S3A connector to read/write data. Because metadata no longer resides on a NameNode, the NameNode bottleneck disappears.

Ceph offers features such as cloud sync, lifecycle management, and erasure coding for redundancy, which are more mature than HDFS’s optional erasure‑coding support.

Data upload via S3A follows these steps:

Client packages the request as HTTP and sends it to RGW.

RGW translates the request to a RADOS operation and forwards it to the Ceph cluster.

The extra RGW hop can become a performance bottleneck, and operations like List Objects or Rename are slower than HDFS. Additionally, the open‑source RGW lacks support for append‑type uploads.

UMStor Hadapter – Bridging the Gap

Leveraging librgw (a Ceph client library), UMStor built a Hadoop storage plugin called Hadapter. The core library libuds wraps librgw, allowing Hadoop clients to use a uds:// scheme. Requests bypass RGW and go directly to librados, eliminating one network hop.

Hadapter is distributed as a simple JAR, easy to deploy on Hadoop nodes, and adds append‑upload support to overcome S3A’s limitation.

Performance Comparison

Benchmark tests (WordCount on a 10 GB dataset) show the following order: HDFS > Hadapter > S3A. In a controlled environment with identical node configurations and three‑replica redundancy, HDFS achieved the best read performance due to data locality, Hadapter was only ~35 seconds slower, while S3A took roughly twice as long as HDFS.

Customer Case

A major video‑service provider (≈35 PB) adopted Hadapter to support HBase, Hive, Spark, Flume, and YARN. The solution is now live, demonstrating the plugin’s suitability for large‑scale, production‑grade big‑data platforms.

Conclusion

HDFS remains the flagship of the compute‑storage fusion camp, while Hadapter showcases the promise of compute‑storage separation with better manageability. The UMStor team is working on Hadapter 2.0 to improve compatibility and performance, hinting at a future where the strengths of both models are unified.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Lake HDFS Ceph S3A

Written by

UCloud Tech

UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.