OrangeFS: A Cloud‑Native Multi‑Protocol Distributed Data Lake Storage System
OrangeFS is Didi’s cloud‑native, multi‑protocol distributed data‑lake storage system that unifies POSIX, S3 and HDFS access on a single logical hierarchy, integrates with Kubernetes via a CSI plugin, supports on‑premise and public‑cloud backends, provides multi‑tenant isolation, and dramatically improves elasticity, utilization and latency for petabyte‑scale workloads such as ride‑hailing logs, machine‑learning training, finance and analytics.
In 2015, Didi launched the GIFT small‑object storage project to handle small files and images. As the business grew, the storage architecture evolved multiple times, leading to new challenges illustrated in the accompanying diagram.
Two major trends—cloud‑native technology strategy and emerging new services—create additional pressures on the storage system.
For cloud‑native workloads, extreme elasticity requires lightweight containers and compute‑storage separation. Existing elastic‑cloud containers rely on local disks for logs and data, causing strong host coupling, low disk utilization (~30%), and risk of data loss after container migration. Logs also need to be duplicated to HDFS, increasing cost.
New services such as autonomous driving, machine learning, internationalization, and finance generate massive edge‑to‑cloud data. Machine‑learning training currently uses S3 protocol to upload data to Didi’s GIFT object store and mounts it via S3FS, which suffers from long latency, high overhead, and limited support for append/rename operations.
Solution Idea
To meet both cloud‑native and new‑service demands, the elastic‑cloud K8s mounts network disks via the POSIX protocol for logs/data and concurrently queries logs via S3 or HDFS, eliminating data loss from drift and reducing long collection pipelines while improving disk utilization. Machine‑learning training can also use S3 for upload and POSIX for mounting, shortening the workflow and lowering latency.
Didi therefore built a multi‑protocol fused cloud‑native distributed storage system, internally named OrangeFS Cloud‑Native Data Lake Storage System . Its core technologies include:
Multi‑Protocol Fusion : A unified file organization structure enables POSIX, S3, and HDFS protocols to operate on the same logical hierarchy.
Cloud‑Native : Integration with Kubernetes via a CSI plugin for seamless volume provisioning.
Multi‑Cloud Storage Engine : Supports on‑premise DFS as well as public‑cloud backends (AWS S3, Alibaba OSS, Tencent COS, Google Cloud, etc.) to keep architecture consistent across cloud and on‑premise.
Multi‑Tenant : Fine‑grained tenant isolation reduces deployment cost.
Core Technology – File Organization Structure
Two common multi‑protocol models are examined: (1) building a file system on top of object storage (e.g., S3FS) and (2) using a unified file organization that supports POSIX, S3, and HDFS (e.g., JuiceFS, CubeFS). OrangeFS adopts the latter.
In S3FS, atomic rename and random writes are not supported. OrangeFS’s unified structure stores files as Chunk → Blob → Block :
Each Chunk is a fixed‑size logical segment.
A Blob (Binary Large Object) groups one or more writes within a Chunk.
A Block is a fixed‑size data block stored in the underlying DFS or public‑cloud storage.
Metadata (Chunk and Blob) resides in a custom MDS service, while Blocks are stored in the DFS service or public‑cloud object stores.
Core Technology – Multi‑Protocol Fusion
OrangeFS implements a VFS layer for POSIX and a PathFS layer for S3/HDFS. Both layers invoke unified read/write interfaces.
Fusion Write Process
Locate the file’s inode, derive the relevant Chunk set from offset and length.
Reuse an existing Blob if possible; otherwise create a new Blob.
Copy the Blob’s data to the corresponding Block in the storage backend.
Update inode length and asynchronously acknowledge the write.
Upload the Block to the data‑storage service.
Commit Blob metadata and inode version.
Fusion Read Process
Locate the inode and derive the Chunk set.
Fetch the Blob list for each Chunk (cache‑first, fallback to RDS).
Merge overlapping Blobs, constructing a coherent Blob view.
Translate Blob+offset+length into a list of Blocks.
Read Blocks from the storage service (cache‑first, then multi‑cloud backend).
Assemble Block data and return a successful read to FUSE.
Cloud‑Native Technology
OrangeFS provides a FUSE‑based POSIX client (OrangeFS‑Posix) that runs in user space, avoiding kernel‑level security risks and simplifying debugging. It supports two types of network disks: data disks and log disks, the latter offering weakened sync, black‑hole, and auto‑timeout features for high availability.
OFS‑CSI Plugin
The CSI node‑driver is split into a super‑agent managed component and a provisioner/kubelet component that communicate via Unix Domain Sockets, enabling seamless driver upgrades without disrupting existing mounts.
OFS‑Posix Client
Implemented in user space using FUSE, it handles file operations, interacts with the MDS cluster for metadata, and communicates with on‑premise DFS or public‑cloud OSS/S3/COS for data blocks. It offers QoS controls (bandwidth, QPS, black‑hole) and caches both metadata and data.
Timeout Mechanism
Heavy operations (Read, Write, Flush, FSync, FAllocate) are wrapped in Goroutines. If a timeout occurs, the system returns success to the caller while discarding the pending write (black‑hole) or reporting EINTER/EBUSY errors.
Black‑Hole Mechanism
When enabled, writes that exceed the timeout are considered successful but the data is dropped, leaving a “hole” in the file that indicates the missing range.
Weak Sync
In single‑client scenarios, Flush and FSync are no‑ops; periodic background flushes ensure eventual consistency.
Simple Applications
Typical usage scenarios include mounting the volume in a container or Linux host, then accessing data via POSIX, S3, or HDFS protocols.
/home/ofs/bin/orangefs posix mount -debug=true -log-dir=/home/ofs/log -rs-addr=10.0.0.1:8030 -rs-model=mds -volume-name=ofs2 -mount-point=/home/ofs/mountExamples of operations:
List files on the mounted volume.
Download data using the S3 protocol.
Upload data using the S3 protocol.
Query files and data via POSIX.
Access data via the HDFS protocol.
All operations are demonstrated with screenshots in the original article.
Conclusion
The article explains why Didi developed a data‑lake storage solution, outlines the architectural approach, core design points, and simple usage examples. OrangeFS now serves hundreds of petabytes of real‑time business workloads—including ride‑hailing logs, machine learning, finance, performance monitoring, service discovery, and big‑data analytics—with total capacity exceeding tens of petabytes.
Future articles will dive deeper into metadata services, multi‑cloud storage, and hot‑upgrade techniques.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.