Ceph Distributed Storage System: Architecture, CRUSH Algorithm, and Storage Backend Evolution
This article introduces Ceph's distributed storage architecture, explains the CRUSH data placement algorithm, compares FileStore and BlueStore backends, and discusses performance optimizations and emerging storage technologies for cloud‑native environments, including considerations of scalability, fault tolerance, and hardware integration.
Ceph, originally developed at UC Santa Cruz in 2006 by Sage Weil, addresses metadata bottlenecks in distributed file systems by using the CRUSH algorithm to map data directly to storage nodes without a central metadata server. The project became part of the Linux kernel in 2015 and was later acquired by Red Hat, becoming a core component of many private cloud platforms.
The Ceph cluster is built on RADOS, which provides a highly reliable, high‑performance, fully distributed object storage service. Objects are stored on OSDs (Object Storage Devices), while MDS (Metadata Servers) handle CephFS metadata. RADOS supports custom failure domains and dynamic load balancing across heterogeneous hardware.
Data placement relies on the CRUSH algorithm, a scalable hash‑based method that lets clients compute the location of objects and access OSDs directly. CRUSH improves scalability and performance, supports multiple replica and erasure‑coding schemes, and offers bucket types (Uniform, List, Tree, Straw) to adapt to diverse hardware deployments.
Ceph offers unified access interfaces: object storage via librados and RGW (compatible with S3 and Swift), block storage via RBD, and file system access via CephFS. Each interface leverages the underlying RADOS objects, providing features such as snapshots, replication, and high availability.
Initially Ceph used FileStore, which stored objects on local file systems (XFS, ext4, btrfs) and suffered from metadata‑transaction overhead, double‑write penalties, and poor performance at scale. In 2015 the community introduced BlueStore, which writes directly to raw devices, bypasses the local file system, and uses a KV index (BlueFS) for metadata, resulting in significant read/write performance gains.
Despite its advantages, BlueStore faces challenges with SSD/NVMe optimization, memory‑intensive metadata structures, and occasional double‑write issues. To better support flash devices, the community proposed SeaStore, a segment‑based layout that performs garbage collection at the device level, enabling efficient writes on NVMe SSDs.
Other specialized backends include PFStore, a user‑space storage engine built on SPDK that uses append‑only writes and RocksDB for metadata, and ongoing research on hardware‑specific optimizations for Open‑channel SSDs, 3DXPoint, NVM, and SMR drives.
The article concludes by noting that Ceph’s design—high scalability, fault tolerance, and flexible storage interfaces—makes it a versatile solution for cloud‑native environments, while continuous backend innovations aim to further improve performance on emerging storage hardware.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.