How US3 Hadoop Adapter Cuts Big Data Storage Costs and Boosts Performance
This article explains how UCloud's US3 object storage, combined with a custom Hadoop adapter, separates compute and storage, optimizes file system operations, and leverages caching and specialized APIs to dramatically reduce storage costs and improve read/write performance for large‑scale Hadoop workloads.
In the era of information explosion, reducing the cost of massive data storage has become a critical step for enterprises handling big‑data workloads. UCloud’s self‑developed next‑generation object storage service US3 offers compute‑storage separation and backup solutions tailored for Hadoop scenarios.
Overall Design Idea
Hadoop interacts with storage through the generic FileSystem base class. The US3Hadoop adapter implements this class via US3FileSystem, similar to HDFS’s DistributedFileSystem and AWS S3’s S3AFileSystem. All I/O and metadata requests are sent directly to US3, as shown in the architecture diagram below.
The adapter distinguishes between index‑only APIs (e.g., HEADFile, ListObjects, Rename, DeleteFile, Copy) and data‑transfer APIs ( GetFile, PutFile for files < 4 MB, and the multipart suite InitiateMultipartUpload, UploadPart, FinishMultipartUpload, AbortMultipartUpload).
Key FileSystem Methods Overridden
initialize
create
rename
getFileStatus
open
listStatus
mkdirs
setOwner
setPermission
setReplication
setWorkingDirectory
getWorkingDirectory
getScheme
getUri
getDefaultBlockSize
delete
Methods that cannot be reasonably simulated, such as append, are overridden to throw an unsupported‑operation exception.
getFileStatus’s Time‑Space Tradeoff
Index operations account for over 70 % of calls in big‑data workloads, with getFileStatus being the most frequent. US3 stores directories as keys ending with ‘/’, while Hadoop’s Path objects omit the trailing slash, causing extra HEADFile checks. To avoid this, the adapter creates two zero‑byte placeholder objects: one with mime‑type “file/path” for the file key, and another with mime‑type “application/x-director” for the directory key.
Additionally, a 3‑second cache stores FileStatus results. Subsequent operations on the same key within this window reuse the cached entry, while delete and rename actions update the cache accordingly. This cache can reduce getFileStatus latency by hundreds of times, albeit with limited consistency concerns that are acceptable for read‑heavy Hadoop jobs.
ListObjects Consistency Issue
US3’s ListObjects currently provides eventual consistency, similar to other object‑storage services. To achieve stronger guarantees, the adapter performs an internal “reconciliation” step after create/rename/delete operations, repeatedly invoking ListObjects until the expected state is observed, effectively delivering read‑your‑writes consistency for most Hadoop use cases.
Rename Deep Customization
Unlike generic object‑storage solutions that implement rename via a copy‑then‑delete sequence, US3 offers a dedicated rename API. The US3Hadoop adapter leverages this API, keeping rename latency in the millisecond range even for large files.
Ensuring Efficient Read
Read operations dominate big‑data workloads. The adapter adds a prefetch buffer to the underlying HTTP stream, reducing system‑call frequency and latency for sequential reads. For random reads, the adapter distinguishes two scenarios:
Seek to a position before the current read pointer triggers a delayed stream reopen; if the target offset lies within the buffer, only the buffer’s consumption pointer is adjusted.
Seek to a position after the current pointer follows the same delayed‑open logic, but if the distance is less than the remaining buffer space plus 16 KB, the adapter again adjusts the buffer pointer without reopening the stream.
When the underlying stream is unexpectedly closed (e.g., TCP reset), the adapter transparently reopens the stream at the last read position to maintain availability.
Conclusion
The US3Hadoop adapter, built on open‑source concepts, resolves key performance and reliability challenges when Hadoop accesses US3, acting as a vital bridge in many customer deployments. While further improvements are planned—especially around index latency, I/O delay, and atomicity—the current US3Vmds solution already narrows the performance gap with native HDFS, delivering substantial read/write efficiency gains for big‑data scenarios.
UCloud Tech
UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
