Cloud Native 17 min read

OrangeFS: A Cloud‑Native Multi‑Protocol Distributed Data Lake Storage System

OrangeFS is Didi’s cloud‑native, multi‑protocol distributed data‑lake storage system that unifies POSIX, S3 and HDFS access on a single logical hierarchy, integrates with Kubernetes via a CSI plugin, supports on‑premise and public‑cloud backends, provides multi‑tenant isolation, and dramatically improves elasticity, utilization and latency for petabyte‑scale workloads such as ride‑hailing logs, machine‑learning training, finance and analytics.

Didi Tech

Sep 19, 2023

OrangeFS: A Cloud‑Native Multi‑Protocol Distributed Data Lake Storage System

In 2015, Didi launched the GIFT small‑object storage project to handle small files and images. As the business grew, the storage architecture evolved multiple times, leading to new challenges illustrated in the accompanying diagram.

Two major trends—cloud‑native technology strategy and emerging new services—create additional pressures on the storage system.

For cloud‑native workloads, extreme elasticity requires lightweight containers and compute‑storage separation. Existing elastic‑cloud containers rely on local disks for logs and data, causing strong host coupling, low disk utilization (~30%), and risk of data loss after container migration. Logs also need to be duplicated to HDFS, increasing cost.

New services such as autonomous driving, machine learning, internationalization, and finance generate massive edge‑to‑cloud data. Machine‑learning training currently uses S3 protocol to upload data to Didi’s GIFT object store and mounts it via S3FS, which suffers from long latency, high overhead, and limited support for append/rename operations.

Solution Idea

To meet both cloud‑native and new‑service demands, the elastic‑cloud K8s mounts network disks via the POSIX protocol for logs/data and concurrently queries logs via S3 or HDFS, eliminating data loss from drift and reducing long collection pipelines while improving disk utilization. Machine‑learning training can also use S3 for upload and POSIX for mounting, shortening the workflow and lowering latency.

Didi therefore built a multi‑protocol fused cloud‑native distributed storage system, internally named OrangeFS Cloud‑Native Data Lake Storage System . Its core technologies include:

Multi‑Protocol Fusion : A unified file organization structure enables POSIX, S3, and HDFS protocols to operate on the same logical hierarchy.

Cloud‑Native : Integration with Kubernetes via a CSI plugin for seamless volume provisioning.

Multi‑Cloud Storage Engine : Supports on‑premise DFS as well as public‑cloud backends (AWS S3, Alibaba OSS, Tencent COS, Google Cloud, etc.) to keep architecture consistent across cloud and on‑premise.

Multi‑Tenant : Fine‑grained tenant isolation reduces deployment cost.

Core Technology – File Organization Structure

Two common multi‑protocol models are examined: (1) building a file system on top of object storage (e.g., S3FS) and (2) using a unified file organization that supports POSIX, S3, and HDFS (e.g., JuiceFS, CubeFS). OrangeFS adopts the latter.

In S3FS, atomic rename and random writes are not supported. OrangeFS’s unified structure stores files as Chunk → Blob → Block:

Each Chunk is a fixed‑size logical segment.

A Blob (Binary Large Object) groups one or more writes within a Chunk.

A Block is a fixed‑size data block stored in the underlying DFS or public‑cloud storage.

Metadata (Chunk and Blob) resides in a custom MDS service, while Blocks are stored in the DFS service or public‑cloud object stores.

Core Technology – Multi‑Protocol Fusion

OrangeFS implements a VFS layer for POSIX and a PathFS layer for S3/HDFS. Both layers invoke unified read/write interfaces.

Fusion Write Process

Locate the file’s inode, derive the relevant Chunk set from offset and length.

Reuse an existing Blob if possible; otherwise create a new Blob.

Copy the Blob’s data to the corresponding Block in the storage backend.

Update inode length and asynchronously acknowledge the write.

Upload the Block to the data‑storage service.

Commit Blob metadata and inode version.

Fusion Read Process

Locate the inode and derive the Chunk set.

Fetch the Blob list for each Chunk (cache‑first, fallback to RDS).

Merge overlapping Blobs, constructing a coherent Blob view.

Translate Blob+offset+length into a list of Blocks.

Read Blocks from the storage service (cache‑first, then multi‑cloud backend).

Assemble Block data and return a successful read to FUSE.

Cloud‑Native Technology

OrangeFS provides a FUSE‑based POSIX client (OrangeFS‑Posix) that runs in user space, avoiding kernel‑level security risks and simplifying debugging. It supports two types of network disks: data disks and log disks, the latter offering weakened sync, black‑hole, and auto‑timeout features for high availability.

OFS‑CSI Plugin

The CSI node‑driver is split into a super‑agent managed component and a provisioner/kubelet component that communicate via Unix Domain Sockets, enabling seamless driver upgrades without disrupting existing mounts.

OFS‑Posix Client

Implemented in user space using FUSE, it handles file operations, interacts with the MDS cluster for metadata, and communicates with on‑premise DFS or public‑cloud OSS/S3/COS for data blocks. It offers QoS controls (bandwidth, QPS, black‑hole) and caches both metadata and data.

Timeout Mechanism

Heavy operations (Read, Write, Flush, FSync, FAllocate) are wrapped in Goroutines. If a timeout occurs, the system returns success to the caller while discarding the pending write (black‑hole) or reporting EINTER/EBUSY errors.

Black‑Hole Mechanism

When enabled, writes that exceed the timeout are considered successful but the data is dropped, leaving a “hole” in the file that indicates the missing range.

Weak Sync

In single‑client scenarios, Flush and FSync are no‑ops; periodic background flushes ensure eventual consistency.

Simple Applications

Typical usage scenarios include mounting the volume in a container or Linux host, then accessing data via POSIX, S3, or HDFS protocols.

/home/ofs/bin/orangefs posix mount -debug=true -log-dir=/home/ofs/log -rs-addr=10.0.0.1:8030 -rs-model=mds -volume-name=ofs2 -mount-point=/home/ofs/mount

Examples of operations:

List files on the mounted volume.

Download data using the S3 protocol.

Upload data using the S3 protocol.

Query files and data via POSIX.

Access data via the HDFS protocol.

All operations are demonstrated with screenshots in the original article.

Conclusion

The article explains why Didi developed a data‑lake storage solution, outlines the architectural approach, core design points, and simple usage examples. OrangeFS now serves hundreds of petabytes of real‑time business workloads—including ride‑hailing logs, machine learning, finance, performance monitoring, service discovery, and big‑data analytics—with total capacity exceeding tens of petabytes.

Future articles will dive deeper into metadata services, multi‑cloud storage, and hot‑upgrade techniques.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kubernetes Data Lake cloud native storage CSI Distributed File System FUSE Multi-Protocol

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.