Can Data Lakes Combine Compute and Storage? Exploring HDFS, S3A, and UMStor Hadapter
This article examines the evolution of data lake architectures, comparing the compute‑storage fusion model of HDFS, the compute‑storage separation approach of S3A on Ceph, and a new UMStor Hadapter plugin that aims to unite their strengths while addressing performance bottlenecks.
Introduction
For many people born before the new millennium, the allure of martial‑arts stories mirrors the excitement of the tech world, where systems such as Windows vs. Linux or private cloud vs. IOE clash like rival schools.
Two Factions in Data Lakes
The term “data lake” was first coined in 2011 by Dan Woods. A data lake is an information system that (1) can store massive parallel data and (2) allows computation without moving the data elsewhere.
In practice, data lakes fall into three patterns:
1. Compute‑Storage Fusion (Unified Cluster)
Compute and storage resources are combined in a single cluster. As business lines grow, they compete for compute resources, and scaling requires expanding both compute and storage together, which can be inconvenient.
2. Compute‑Storage Fusion Pro (Isolated Clusters)
Each business line receives its own isolated data‑lake cluster, eliminating compute contention but creating data silos because the same dataset must be copied to multiple clusters, increasing storage cost and replication overhead.
3. Compute‑Storage Separation
Compute clusters are separate from a shared storage pool, solving the data‑duplication problem and allowing independent scaling of compute and storage, which aligns with elastic computing principles.
Compute‑Storage Fusion – HDFS
HDFS (Hadoop Distributed File System) exemplifies the fusion model. Writing data involves six steps:
Client sends a create‑file request to the NameNode.
NameNode validates and authorizes the request.
Client splits the file into blocks (default 128 MB).
NameNode returns DataNode locations for each block (typically three replicas).
Client streams each block to the chosen DataNodes via a pipeline.
Process repeats until the entire file is stored.
HDFS relies on data locality: computation is scheduled on nodes that hold the required data, reducing network traffic and improving read performance. However, large block sizes can cause poor data balance and under‑utilize cluster storage.
Compute‑Storage Separation – S3A on Ceph
In private‑cloud deployments, Ceph’s object storage (accessed via RGW’s S3 interface) provides a shared backend. Applications use the S3A connector to read/write data. Because metadata no longer resides on a NameNode, the NameNode bottleneck disappears.
Ceph offers features such as cloud sync, lifecycle management, and erasure coding for redundancy, which are more mature than HDFS’s optional erasure‑coding support.
Data upload via S3A follows these steps:
Client packages the request as HTTP and sends it to RGW.
RGW translates the request to a RADOS operation and forwards it to the Ceph cluster.
The extra RGW hop can become a performance bottleneck, and operations like List Objects or Rename are slower than HDFS. Additionally, the open‑source RGW lacks support for append‑type uploads.
UMStor Hadapter – Bridging the Gap
Leveraging librgw (a Ceph client library), UMStor built a Hadoop storage plugin called Hadapter. The core library libuds wraps librgw, allowing Hadoop clients to use a uds:// scheme. Requests bypass RGW and go directly to librados, eliminating one network hop.
Hadapter is distributed as a simple JAR, easy to deploy on Hadoop nodes, and adds append‑upload support to overcome S3A’s limitation.
Performance Comparison
Benchmark tests (WordCount on a 10 GB dataset) show the following order: HDFS > Hadapter > S3A. In a controlled environment with identical node configurations and three‑replica redundancy, HDFS achieved the best read performance due to data locality, Hadapter was only ~35 seconds slower, while S3A took roughly twice as long as HDFS.
Customer Case
A major video‑service provider (≈35 PB) adopted Hadapter to support HBase, Hive, Spark, Flume, and YARN. The solution is now live, demonstrating the plugin’s suitability for large‑scale, production‑grade big‑data platforms.
Conclusion
HDFS remains the flagship of the compute‑storage fusion camp, while Hadapter showcases the promise of compute‑storage separation with better manageability. The UMStor team is working on Hadapter 2.0 to improve compatibility and performance, hinting at a future where the strengths of both models are unified.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
UCloud Tech
UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
