How Didi Built a Multi‑Protocol, Petabyte‑Scale Storage System for AI Training
Facing petabyte‑level data, billions of small files, and the need for POSIX, S3, and HDFS compatibility, Didi designed a new generation of non‑structured storage—OrangeFS—by analyzing internal systems, combining multiple storage solutions, reusing GIFT technology, and implementing a high‑performance metadata service, multi‑protocol fusion, and robust scalability features.
Project Background
With the rapid rise of AI, machine‑learning training has become a key driver for industry digital transformation. Large models demand massive data, complex model structures, and high‑frequency reads/writes, exposing the limits of traditional unstructured storage.
Business Requirements
Support >100 PB of data with billions of small files per volume.
Provide low‑latency metadata (<10 ms write, <2 ms read) and high bandwidth (>100 GB/s).
Offer both object‑storage ease of use and POSIX file‑system semantics.
Enable multi‑tenant isolation, QoS, and cloud‑native deployment via CSI.
Allow seamless multi‑protocol access (POSIX, S3, HDFS) without data duplication.
Exploration of Existing Solutions
Didi evaluated its own GIFT object storage (2.0/3.0), Ceph, HDFS, and GlusterFS. All fell short: GIFT lacked POSIX/HDFS support, Ceph separated object and file layers, HDFS focused on large files, and GlusterFS missed multi‑tenant, S3, and HDFS compatibility.
Multi‑Storage Combination Attempts
Combining GIFT with GlusterFS or a class‑S3FS layer reduced data duplication but introduced copy overhead, limited scalability, and required double storage of data.
Industry Solution – JuiceFS + RDS + GIFT
JuiceFS offers POSIX, S3, and HDFS compatibility and cloud‑native support. However, its reliance on RDS for metadata caused read/write latency spikes and consistency issues under high load.
OrangeFS – New Generation Storage Design
OrangeFS integrates lessons from GIFT, Ceph, HDFS, CubeFS, JuiceFS, and SeaweedFS, adding multi‑tenant support, a configuration center, and CSI‑based Kubernetes integration.
Reusing GIFT Technology
Service entry supporting S3 and GIFT V2 protocols.
RDS‑backed metadata service.
Config center for tenant, traffic, and bucket management.
BS storage engine for object data, with Facebook‑Haystack‑inspired block management.
Data Model – Chunk / Blob / Block
Files are split into fixed‑size Chunks . Each Chunk may contain multiple variable‑size Blobs , which are composed of fixed‑length Blocks . Chunk/Blob metadata is stored in a custom MDS service, while Blocks reside in local storage engines or public‑cloud S3/OSS/COS.
Write Path
Default: one Blob per Chunk to minimize Blob count.
New Blob created only when write region overlaps existing Blob.
Append‑only when new data does not overlap existing Blocks.
Large Blocks are flushed to remote storage automatically.
Read Path
When reading a Chunk, all associated Blobs are merged into a single Blob, and its Blocks are returned to the caller.
Multi‑Protocol Fusion
POSIX uses a relative‑path VFS layer, while S3/HDFS use an absolute‑path PathFS layer. Both resolve to the same inode, offset, and data, enabling seamless cross‑protocol access.
MDS Metadata Service
The MDS consists of Root Server (stateless config management), Meta Server (metadata & directory tree), and PD Server (Raft control). It supports Multi‑Raft groups, in‑memory transactions, and fast leader‑read consistency.
Performance Optimizations
Reduced RPCs: a single RPC per operation instead of multiple.
Queue + thread‑pool model for Raft propose, moving heavy work out of the critical path.
Switched GRPC to raw TCP with async batch sending, halving CPU usage.
Raft batch & pipeline to improve throughput and lower latency.
Concurrency Control
Implemented optimistic in‑memory transactions with conflict detection, serializing high‑conflict POSIX writes and S3 puts via hash‑based queues.
Scalability
Added OP‑level follower reads and dynamic Learner expansion to distribute read load across Raft nodes, preventing leader bottlenecks at high QPS.
Stability Enhancements
Snapshot recovery switched to SSTable ingestion.
Log compaction paused during snapshot sync with timeout.
Follower‑read throttling based on LogIndexGap.
Busy/Slow queues isolate problematic volumes.
High‑Efficiency Metadata Design
Optimized rename by using relative‑path keys and directory‑tree caching.
Readdir combines Dentry and Attr in a single table with a Link table for rare hard‑link cases, enabling one‑scan reads.
POSIX Client (OFS) Features
Fuse‑based client with full POSIX compatibility and seconds‑level hot upgrade.
Cache‑driven high throughput (tens of GB/s) for AI training.
Read‑write decoupling via lightweight memory snapshots.
Hot upgrade workflow: exec new process, transfer fuse fd, flush caches, and switch without service interruption.
Multi‑Team Collaboration
Sub‑directory TTL for automatic data expiration using timestamp indexes.
Bucket/Volume and sub‑directory read‑only modes.
Fine‑grained permission control (AK/SK, POSIX, mknod, read/write).
Recycle‑bin (Trash) storing deleted and expired files with metadata‑only operations; supports 7‑day retention and bulk recovery.
Overall, OrangeFS delivers petabyte‑scale, multi‑protocol, cloud‑native storage that powers Didi’s AI training, big‑data analytics, finance, and international services with high performance, strong isolation, and operational robustness.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
