Ceph Architecture Overview, Core Components, IO Flow, Heartbeat Mechanism, and CRUSH Algorithm
This article provides a comprehensive technical overview of Ceph, covering its architecture, key components, storage types, IO processes, heartbeat detection, communication framework, CRUSH data placement algorithm, and customizable RBD QoS mechanisms, with detailed diagrams and pseudo‑code explanations.
1. Ceph Architecture Overview and Usage Scenarios
Ceph is a unified distributed storage system designed for high performance, reliability, and scalability, originally developed in 2004 and now widely adopted by cloud platforms such as RedHat and OpenStack for block, file, and object storage back‑ends.
1.1 Ceph Features
Key features include high performance through the CRUSH algorithm, high availability with flexible replica counts and automatic self‑healing, linear scalability across thousands of nodes, and rich interfaces supporting block, file, and object storage.
1.2 Core Components
Core components are Monitors (metadata coordination), OSDs (Object Storage Devices), MDS (metadata server for CephFS), PGs (Placement Groups), RADOS (reliable autonomic distributed object store), and various client interfaces such as RBD, RGW, and CephFS.
2. Ceph IO Process and Data Distribution
The normal IO flow involves the client creating a handler, reading configuration, connecting to monitors to obtain the cluster map, using the CRUSH map to locate the primary OSD, and writing data to primary and replica OSDs before acknowledging the client.
2.1 New Primary IO Flow
When a new OSD becomes primary without existing PG data, it reports to the monitor, a temporary primary OSD takes over, synchronizes data, and later hands over the primary role once synchronization completes.
2.2 IO Algorithm Pseudo‑code
locator = object_name
obj_hash = hash(locator)
pg = obj_hash % num_pg
osds_for_pg = crush(pg) # returns a list of osds
primary = osds_for_pg[0]
replicas = osds_for_pg[1:]2.3 RBD IO Flow
Clients create a pool with a defined number of PGs, mount an RBD image, split data into 4 MiB objects, map objects to PGs, and store each object on three OSDs using the CRUSH algorithm.
3. Ceph Heartbeat Mechanism
Heartbeat messages are exchanged between OSDs, monitors, and clients to detect failures quickly while balancing load and tolerating network jitter. OSDs send periodic pings, and monitors aggregate failure reports before marking OSDs down.
4. Ceph Communication Framework
Three network models are supported: a simple thread‑per‑connection model, an asynchronous I/O multiplexing model, and an XIO‑based model using the Accelio library. The framework follows a publish/subscribe pattern with Messengers, Pipes, Dispatchers, and DispatchQueues handling message flow.
5. Ceph CRUSH Algorithm
CRUSH (Controlled Scalable Decentralized Placement of Replicated Data) maps objects to placement groups and then to OSDs using hierarchical cluster maps and placement rules, enabling balanced data distribution, fault domain isolation, and efficient scaling.
5.1 Placement Rules Example
A sample rule set defines a replicated pool, minimum and maximum replica counts, and a leaf‑selection strategy that chooses hosts and OSDs based on the cluster hierarchy.
6. Customizable Ceph RBD QoS
QoS is implemented via a token‑bucket algorithm (mClock) that controls I/O rates per client by reserving bandwidth, assigning weights, and enforcing limits, ensuring fair resource allocation across users.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.