Big Data 13 min read

How US3 Hadoop Adapter Cuts Big Data Storage Costs and Boosts Performance

This article explains how UCloud's US3 object storage, combined with a custom Hadoop adapter, separates compute and storage, optimizes file system operations, and leverages caching and specialized APIs to dramatically reduce storage costs and improve read/write performance for large‑scale Hadoop workloads.

UCloud Tech

May 21, 2021

How US3 Hadoop Adapter Cuts Big Data Storage Costs and Boosts Performance

In the era of information explosion, reducing the cost of massive data storage has become a critical step for enterprises handling big‑data workloads. UCloud’s self‑developed next‑generation object storage service US3 offers compute‑storage separation and backup solutions tailored for Hadoop scenarios.

Overall Design Idea

Hadoop interacts with storage through the generic FileSystem base class. The US3Hadoop adapter implements this class via US3FileSystem, similar to HDFS’s DistributedFileSystem and AWS S3’s S3AFileSystem. All I/O and metadata requests are sent directly to US3, as shown in the architecture diagram below.

The adapter distinguishes between index‑only APIs (e.g., HEADFile, ListObjects, Rename, DeleteFile, Copy) and data‑transfer APIs ( GetFile, PutFile for files < 4 MB, and the multipart suite InitiateMultipartUpload, UploadPart, FinishMultipartUpload, AbortMultipartUpload).

Key FileSystem Methods Overridden

initialize

create

rename

getFileStatus

open

listStatus

mkdirs

setOwner

setPermission

setReplication

setWorkingDirectory

getWorkingDirectory

getScheme

getUri

getDefaultBlockSize

delete

Methods that cannot be reasonably simulated, such as append, are overridden to throw an unsupported‑operation exception.

getFileStatus’s Time‑Space Tradeoff

Index operations account for over 70 % of calls in big‑data workloads, with getFileStatus being the most frequent. US3 stores directories as keys ending with ‘/’, while Hadoop’s Path objects omit the trailing slash, causing extra HEADFile checks. To avoid this, the adapter creates two zero‑byte placeholder objects: one with mime‑type “file/path” for the file key, and another with mime‑type “application/x-director” for the directory key.

Additionally, a 3‑second cache stores FileStatus results. Subsequent operations on the same key within this window reuse the cached entry, while delete and rename actions update the cache accordingly. This cache can reduce getFileStatus latency by hundreds of times, albeit with limited consistency concerns that are acceptable for read‑heavy Hadoop jobs.

ListObjects Consistency Issue

US3’s ListObjects currently provides eventual consistency, similar to other object‑storage services. To achieve stronger guarantees, the adapter performs an internal “reconciliation” step after create/rename/delete operations, repeatedly invoking ListObjects until the expected state is observed, effectively delivering read‑your‑writes consistency for most Hadoop use cases.

Rename Deep Customization

Unlike generic object‑storage solutions that implement rename via a copy‑then‑delete sequence, US3 offers a dedicated rename API. The US3Hadoop adapter leverages this API, keeping rename latency in the millisecond range even for large files.

Ensuring Efficient Read

Read operations dominate big‑data workloads. The adapter adds a prefetch buffer to the underlying HTTP stream, reducing system‑call frequency and latency for sequential reads. For random reads, the adapter distinguishes two scenarios:

Seek to a position before the current read pointer triggers a delayed stream reopen; if the target offset lies within the buffer, only the buffer’s consumption pointer is adjusted.

Seek to a position after the current pointer follows the same delayed‑open logic, but if the distance is less than the remaining buffer space plus 16 KB, the adapter again adjusts the buffer pointer without reopening the stream.

When the underlying stream is unexpectedly closed (e.g., TCP reset), the adapter transparently reopens the stream at the last read position to maintain availability.

Conclusion

The US3Hadoop adapter, built on open‑source concepts, resolves key performance and reliability challenges when Hadoop accesses US3, acting as a vital bridge in many customer deployments. While further improvements are planned—especially around index latency, I/O delay, and atomicity—the current US3Vmds solution already narrows the performance gap with native HDFS, delivering substantial read/write efficiency gains for big‑data scenarios.

Performance big data Cache Hadoop US3

Written by

UCloud Tech

UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.