Cloud Computing 12 min read

Ceph Distributed Storage System: Architecture, CRUSH Algorithm, and Backend Evolution

This article provides a comprehensive overview of Ceph, covering its origins, cluster architecture, CRUSH data placement algorithm, unified access interfaces, the transition from FileStore to BlueStore, and emerging storage back‑ends such as SeaStore and PFStore, highlighting performance characteristics and design trade‑offs.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Ceph Distributed Storage System: Architecture, CRUSH Algorithm, and Backend Evolution

Ceph was created at the University of California, Santa Cruz in 2006 to address metadata bottlenecks in distributed file systems like Lustre, introducing the CRUSH algorithm for data placement; it became part of the Linux kernel in 2015 and was later acquired by Red Hat, supporting three storage interfaces for private cloud platforms.

The system’s design goal is high‑performance, highly‑scalable, and highly‑available distributed storage, built on RADOS which provides a unified object interface, self‑adaptive node management, and eliminates the need for a central metadata server by using CRUSH for client‑side data location.

Cluster Architecture – RADOS offers reliable, high‑performance object storage across heterogeneous clusters, with OSDs as the basic storage units and MDS servers handling metadata for CephFS. Multiple MDS instances can share metadata queries.

Data Placement Algorithm – CRUSH replaces traditional central metadata with a scalable hash‑based replica distribution, allowing clients to compute object locations directly, supporting various placement rules, multi‑replica and erasure‑coding, and offering four bucket types (Uniform, List, Tree, Straw).

While CRUSH provides excellent scalability, it can suffer from weight imbalance, extra data migration during OSD changes, and uneven capacity utilization, leading to the introduction of the upmap mechanism in the Luminous release for manual PG placement.

Unified Access Interfaces – RADOS underpins object, block, and file storage. LIBRADOS accesses objects, RGW offers S3‑compatible object services, RBD provides block storage with snapshots and replication, and CephFS delivers POSIX‑compatible file systems via FUSE or kernel mounts.

FileStore vs. BlueStore – FileStore relied on local file systems (XFS, ext4, etc.) and suffered from metadata‑data separation, double‑write overhead, and poor performance. BlueStore, introduced in 2015, bypasses the local file system, uses a KV index for metadata, and significantly improves read/write performance, especially with erasure coding.

BlueStore’s design, however, faces challenges with SSD/NVMe adaptation, complex metadata structures, and memory consumption.

Emerging Back‑Ends – SeaStore proposes a segment‑based layout for NVMe devices to improve garbage collection, while PFStore (based on SPDK) uses user‑space storage engines and RocksDB for metadata, launching multiple OSD instances for performance.

The article concludes with a note that future content will explore optimizations for new hardware such as Open‑Channel SSDs, 3DXPoint, NVM, and SMR.

cloud storageDistributed StorageBluestoreCephobject storageCRUSH
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.