Databases 22 min read

Ceph Distributed Storage System – Architecture, IO Processes, Heartbeat, Communication Framework, CRUSH Algorithm, and Custom QoS

The article comprehensively explains Ceph’s distributed storage architecture—including monitors, OSDs, MDS, and RADOS—its block, file, and object services, its detailed I/O and heartbeat processes, the publish/subscribe communication framework, the deterministic CRUSH placement algorithm, and a token‑bucket based custom QoS for RBD.

Didi Tech

May 17, 2019

Ceph Distributed Storage System – Architecture, IO Processes, Heartbeat, Communication Framework, CRUSH Algorithm, and Custom QoS

Ceph is a unified distributed storage system designed for high performance, reliability and scalability. It originated from the Sage project in 2004 and is now supported by many cloud vendors such as RedHat and OpenStack.

The article provides a detailed index covering Ceph architecture, usage scenarios, core components, three storage types (block, file, object), IO processes, heartbeat mechanisms, communication framework, CRUSH algorithm, and custom QoS implementation.

1. Architecture and components – Ceph consists of monitors (Mon), object storage daemons (OSD), metadata server (MDS), RADOS, librados, CRUSH, RBD, RGW, and CephFS. Monitors maintain cluster maps, OSDs store objects, MDS handles CephFS metadata, etc.

2. Storage types – Block storage (e.g., disks, RAID, LVM), file storage (FTP/NFS), and object storage (S3/Swift compatible) are described with their advantages, disadvantages and typical use cases.

3. IO flow – Normal IO and new‑primary IO processes are illustrated with flowcharts. The steps include client creating a cluster handler, reading config, connecting to monitors, locating OSDs via CRUSH, writing to primary and replica OSDs, and confirming completion.

4. Heartbeat mechanism – Ceph uses multiple heartbeat channels (public, cluster, front, back) and hbclient messenger. OSDs heartbeat within the same PG every ~6 s; missing heartbeats for 20 s trigger failure handling. Monitors collect failure reports from OSDs and decide when to mark an OSD down, with configurable thresholds to tolerate network jitter.

5. Communication framework – Three implementation models are described: Simple thread‑per‑connection, async event‑driven I/O, and XIO (accelio). The framework follows a publish/subscribe pattern with Acceptors, Pipes, Messengers, Dispatchers and DispatchQueues. The message structure includes header, payload, middle, data, and footer, as shown in the code snippet:

class Message : public RefCountedObject {
protected:
  ceph_msg_header  header;
  ceph_msg_footer  footer;
  bufferlist       payload;
  bufferlist       middle;
  bufferlist       data;
  // timestamps and connection fields...
};

Another snippet shows the pseudo‑code for CRUSH mapping:

locator = object_name
obj_hash = hash(locator)
pg = obj_hash % num_pg
osds_for_pg = crush(pg)  # returns a list of osds
primary = osds_for_pg[0]
replicas = osds_for_pg[1:]

6. CRUSH algorithm – CRUSH (Controlled Scalable Decentralized Placement of Replicated Data) provides deterministic, pseudo‑random placement of objects across OSDs based on a hierarchical cluster map and placement rules. Bucket types (list, tree, straw) and placement rule syntax are explained.

7. Custom QoS for Ceph RBD – To limit IO bandwidth, Ceph adopts a token‑bucket algorithm. The flow includes client async IO entering ImageRequestWQ, passing through TokenBucket for rate limiting, and then being processed. Framework diagrams illustrate the integration of the token bucket into the RBD stack.

Overall, the article serves as a comprehensive guide for engineers and architects working with Ceph, covering both theoretical concepts and practical implementation details.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Ceph CRUSH algorithm heartbeat IO Flow QoS Token Bucket

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.