Industry Insights 27 min read

How Didi Built a Multi‑Protocol, Petabyte‑Scale Storage System for AI Training

Facing petabyte‑level data, billions of small files, and the need for POSIX, S3, and HDFS compatibility, Didi designed a new generation of non‑structured storage—OrangeFS—by analyzing internal systems, combining multiple storage solutions, reusing GIFT technology, and implementing a high‑performance metadata service, multi‑protocol fusion, and robust scalability features.

Didi Tech

Sep 5, 2024

How Didi Built a Multi‑Protocol, Petabyte‑Scale Storage System for AI Training

Project Background

With the rapid rise of AI, machine‑learning training has become a key driver for industry digital transformation. Large models demand massive data, complex model structures, and high‑frequency reads/writes, exposing the limits of traditional unstructured storage.

Business Requirements

Support >100 PB of data with billions of small files per volume.

Provide low‑latency metadata (<10 ms write, <2 ms read) and high bandwidth (>100 GB/s).

Offer both object‑storage ease of use and POSIX file‑system semantics.

Enable multi‑tenant isolation, QoS, and cloud‑native deployment via CSI.

Allow seamless multi‑protocol access (POSIX, S3, HDFS) without data duplication.

Exploration of Existing Solutions

Didi evaluated its own GIFT object storage (2.0/3.0), Ceph, HDFS, and GlusterFS. All fell short: GIFT lacked POSIX/HDFS support, Ceph separated object and file layers, HDFS focused on large files, and GlusterFS missed multi‑tenant, S3, and HDFS compatibility.

Multi‑Storage Combination Attempts

Combining GIFT with GlusterFS or a class‑S3FS layer reduced data duplication but introduced copy overhead, limited scalability, and required double storage of data.

Industry Solution – JuiceFS + RDS + GIFT

JuiceFS offers POSIX, S3, and HDFS compatibility and cloud‑native support. However, its reliance on RDS for metadata caused read/write latency spikes and consistency issues under high load.

OrangeFS – New Generation Storage Design

OrangeFS integrates lessons from GIFT, Ceph, HDFS, CubeFS, JuiceFS, and SeaweedFS, adding multi‑tenant support, a configuration center, and CSI‑based Kubernetes integration.

Reusing GIFT Technology

Service entry supporting S3 and GIFT V2 protocols.

RDS‑backed metadata service.

Config center for tenant, traffic, and bucket management.

BS storage engine for object data, with Facebook‑Haystack‑inspired block management.

Data Model – Chunk / Blob / Block

Files are split into fixed‑size Chunks . Each Chunk may contain multiple variable‑size Blobs , which are composed of fixed‑length Blocks . Chunk/Blob metadata is stored in a custom MDS service, while Blocks reside in local storage engines or public‑cloud S3/OSS/COS.

Write Path

Default: one Blob per Chunk to minimize Blob count.

New Blob created only when write region overlaps existing Blob.

Append‑only when new data does not overlap existing Blocks.

Large Blocks are flushed to remote storage automatically.

Read Path

When reading a Chunk, all associated Blobs are merged into a single Blob, and its Blocks are returned to the caller.

Multi‑Protocol Fusion

POSIX uses a relative‑path VFS layer, while S3/HDFS use an absolute‑path PathFS layer. Both resolve to the same inode, offset, and data, enabling seamless cross‑protocol access.

MDS Metadata Service

The MDS consists of Root Server (stateless config management), Meta Server (metadata & directory tree), and PD Server (Raft control). It supports Multi‑Raft groups, in‑memory transactions, and fast leader‑read consistency.

Performance Optimizations

Reduced RPCs: a single RPC per operation instead of multiple.

Queue + thread‑pool model for Raft propose, moving heavy work out of the critical path.

Switched GRPC to raw TCP with async batch sending, halving CPU usage.

Raft batch & pipeline to improve throughput and lower latency.

Concurrency Control

Implemented optimistic in‑memory transactions with conflict detection, serializing high‑conflict POSIX writes and S3 puts via hash‑based queues.

Scalability

Added OP‑level follower reads and dynamic Learner expansion to distribute read load across Raft nodes, preventing leader bottlenecks at high QPS.

Stability Enhancements

Snapshot recovery switched to SSTable ingestion.

Log compaction paused during snapshot sync with timeout.

Follower‑read throttling based on LogIndexGap.

Busy/Slow queues isolate problematic volumes.

High‑Efficiency Metadata Design

Optimized rename by using relative‑path keys and directory‑tree caching.

Readdir combines Dentry and Attr in a single table with a Link table for rare hard‑link cases, enabling one‑scan reads.

POSIX Client (OFS) Features

Fuse‑based client with full POSIX compatibility and seconds‑level hot upgrade.

Cache‑driven high throughput (tens of GB/s) for AI training.

Read‑write decoupling via lightweight memory snapshots.

Hot upgrade workflow: exec new process, transfer fuse fd, flush caches, and switch without service interruption.

Multi‑Team Collaboration

Sub‑directory TTL for automatic data expiration using timestamp indexes.

Bucket/Volume and sub‑directory read‑only modes.

Fine‑grained permission control (AK/SK, POSIX, mknod, read/write).

Recycle‑bin (Trash) storing deleted and expired files with metadata‑only operations; supports 7‑day retention and bulk recovery.

Overall, OrangeFS delivers petabyte‑scale, multi‑protocol, cloud‑native storage that powers Didi’s AI training, big‑data analytics, finance, and international services with high performance, strong isolation, and operational robustness.

cloud-native big data distributed file system Multi-Protocol metadata service AI storage

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.