Understanding Ceph: Architecture, CRUSH Algorithm, and BlueStore Evolution
This article explores Ceph's origins, its RADOS-based cluster architecture, the CRUSH data placement algorithm, the transition from FileStore to BlueStore, and the unified object, block, and file interfaces that make Ceph a versatile solution for modern cloud storage platforms.
Introduction
Ceph was created in 2006 at the University of California, Santa Cruz by Weil, who designed the CRUSH algorithm to improve metadata handling and scalability for distributed file systems like Lustre. Linux kernel 2.6.34 added Ceph support in May 2015, and Red Hat acquired the company IntTank in 2014.
Cluster Architecture
Ceph’s RADOS layer provides a highly reliable, high‑performance, fully distributed object storage service. Objects are placed across OSDs (Object Storage Devices) based on real‑time node status and customizable failure domains, enabling dynamic load balancing and strong consistency.
Each OSD manages a storage device (typically one disk per OSD) and offers local object storage with strong consistency. Metadata servers (MDS) handle CephFS metadata requests, translating file operations into object operations, and multiple MDS instances can share the metadata workload.
Data Placement Algorithm
The CRUSH algorithm replaces traditional central metadata nodes, allowing clients to compute the location of objects directly on OSDs, which improves scalability and performance.
CRUSH distributes data using hierarchical buckets (Uniform, List, Tree, Straw) and supports multiple replication strategies, including erasure coding. It also enables placement rules that consider failure domains such as racks or data centers.
While CRUSH offers fast data location, it can suffer from weight imbalance, extra data migration during OSD changes, and uneven capacity utilization. The upmap feature introduced in the 2017 Luminous release allows manual PG placement to improve balance.
Unified Access Interfaces
RADOS provides a distributed object store that underpins block (RBD) and file (CephFS) services. LIBRADOS enables direct object access, while RGW offers S3‑compatible and OpenStack Swift‑compatible object APIs.
Block storage (RBD) delivers thin‑provisioned, snapshot‑capable volumes that can be accessed via kernel modules or librbd. File system access (CephFS) stores both data and metadata as objects, with clients mounting via FUSE or kernel drivers and communicating with MDS for directory information.
From FileStore to BlueStore
Early Ceph versions used FileStore, which relied on local file systems (XFS, ext4, btrfs) and suffered from metadata‑data separation issues, double‑write overhead, and poor performance.
In 2015, the community introduced BlueStore, which writes directly to raw devices, bypassing the local file system. BlueStore employs a lightweight internal file system (BlueFS) and a KV index for metadata, eliminating double writes and significantly improving read/write performance.
BlueStore can double performance over FileStore in three‑replica setups and achieve up to three‑fold gains with erasure coding.
However, BlueStore faces challenges with newer hardware such as SSDs, NVMe, and mixed storage, and its metadata structures can be memory‑intensive.
Emerging Storage Engines
To address SSD characteristics, Ceph introduced SeaStore, a layout that partitions device space into large segments for efficient garbage collection, targeting NVMe devices and integrating with SPDK/DPDK for near‑zero‑copy data paths.
Other projects like PFStore (based on SPDK) and upcoming optimizations for Open‑Channel SSDs, 3DXPoint, NVM, and SMR are being explored.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
