Cloud Computing 20 min read

Ceph Architecture Overview, Core Components, IO Flow, Heartbeat Mechanism, and CRUSH Algorithm

This article provides a comprehensive technical overview of Ceph, covering its architecture, key components, storage types, IO processes, heartbeat detection, communication framework, CRUSH data placement algorithm, and customizable RBD QoS mechanisms, with detailed diagrams and pseudo‑code explanations.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Ceph Architecture Overview, Core Components, IO Flow, Heartbeat Mechanism, and CRUSH Algorithm

1. Ceph Architecture Overview and Usage Scenarios

Ceph is a unified distributed storage system designed for high performance, reliability, and scalability, originally developed in 2004 and now widely adopted by cloud platforms such as RedHat and OpenStack for block, file, and object storage back‑ends.

1.1 Ceph Features

Key features include high performance through the CRUSH algorithm, high availability with flexible replica counts and automatic self‑healing, linear scalability across thousands of nodes, and rich interfaces supporting block, file, and object storage.

1.2 Core Components

Core components are Monitors (metadata coordination), OSDs (Object Storage Devices), MDS (metadata server for CephFS), PGs (Placement Groups), RADOS (reliable autonomic distributed object store), and various client interfaces such as RBD, RGW, and CephFS.

2. Ceph IO Process and Data Distribution

The normal IO flow involves the client creating a handler, reading configuration, connecting to monitors to obtain the cluster map, using the CRUSH map to locate the primary OSD, and writing data to primary and replica OSDs before acknowledging the client.

2.1 New Primary IO Flow

When a new OSD becomes primary without existing PG data, it reports to the monitor, a temporary primary OSD takes over, synchronizes data, and later hands over the primary role once synchronization completes.

2.2 IO Algorithm Pseudo‑code

locator = object_name
obj_hash = hash(locator)
pg = obj_hash % num_pg
osds_for_pg = crush(pg)  # returns a list of osds
primary = osds_for_pg[0]
replicas = osds_for_pg[1:]

2.3 RBD IO Flow

Clients create a pool with a defined number of PGs, mount an RBD image, split data into 4 MiB objects, map objects to PGs, and store each object on three OSDs using the CRUSH algorithm.

3. Ceph Heartbeat Mechanism

Heartbeat messages are exchanged between OSDs, monitors, and clients to detect failures quickly while balancing load and tolerating network jitter. OSDs send periodic pings, and monitors aggregate failure reports before marking OSDs down.

4. Ceph Communication Framework

Three network models are supported: a simple thread‑per‑connection model, an asynchronous I/O multiplexing model, and an XIO‑based model using the Accelio library. The framework follows a publish/subscribe pattern with Messengers, Pipes, Dispatchers, and DispatchQueues handling message flow.

5. Ceph CRUSH Algorithm

CRUSH (Controlled Scalable Decentralized Placement of Replicated Data) maps objects to placement groups and then to OSDs using hierarchical cluster maps and placement rules, enabling balanced data distribution, fault domain isolation, and efficient scaling.

5.1 Placement Rules Example

A sample rule set defines a replicated pool, minimum and maximum replica counts, and a leaf‑selection strategy that chooses hosts and OSDs based on the cluster hierarchy.

6. Customizable Ceph RBD QoS

QoS is implemented via a token‑bucket algorithm (mClock) that controls I/O rates per client by reserving bandwidth, assigning weights, and enforcing limits, ensuring fair resource allocation across users.

cloud computingDistributed StorageCephCRUSHIO Process
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.