Ceph Distributed Storage System – Architecture, IO Processes, Heartbeat, Communication Framework, CRUSH Algorithm, and Custom QoS
The article comprehensively explains Ceph’s distributed storage architecture—including monitors, OSDs, MDS, and RADOS—its block, file, and object services, its detailed I/O and heartbeat processes, the publish/subscribe communication framework, the deterministic CRUSH placement algorithm, and a token‑bucket based custom QoS for RBD.
Ceph is a unified distributed storage system designed for high performance, reliability and scalability. It originated from the Sage project in 2004 and is now supported by many cloud vendors such as RedHat and OpenStack.
The article provides a detailed index covering Ceph architecture, usage scenarios, core components, three storage types (block, file, object), IO processes, heartbeat mechanisms, communication framework, CRUSH algorithm, and custom QoS implementation.
1. Architecture and components – Ceph consists of monitors (Mon), object storage daemons (OSD), metadata server (MDS), RADOS, librados, CRUSH, RBD, RGW, and CephFS. Monitors maintain cluster maps, OSDs store objects, MDS handles CephFS metadata, etc.
2. Storage types – Block storage (e.g., disks, RAID, LVM), file storage (FTP/NFS), and object storage (S3/Swift compatible) are described with their advantages, disadvantages and typical use cases.
3. IO flow – Normal IO and new‑primary IO processes are illustrated with flowcharts. The steps include client creating a cluster handler, reading config, connecting to monitors, locating OSDs via CRUSH, writing to primary and replica OSDs, and confirming completion.
4. Heartbeat mechanism – Ceph uses multiple heartbeat channels (public, cluster, front, back) and hbclient messenger. OSDs heartbeat within the same PG every ~6 s; missing heartbeats for 20 s trigger failure handling. Monitors collect failure reports from OSDs and decide when to mark an OSD down, with configurable thresholds to tolerate network jitter.
5. Communication framework – Three implementation models are described: Simple thread‑per‑connection, async event‑driven I/O, and XIO (accelio). The framework follows a publish/subscribe pattern with Acceptors, Pipes, Messengers, Dispatchers and DispatchQueues. The message structure includes header, payload, middle, data, and footer, as shown in the code snippet:
class Message : public RefCountedObject {
protected:
ceph_msg_header header;
ceph_msg_footer footer;
bufferlist payload;
bufferlist middle;
bufferlist data;
// timestamps and connection fields...
};Another snippet shows the pseudo‑code for CRUSH mapping:
locator = object_name
obj_hash = hash(locator)
pg = obj_hash % num_pg
osds_for_pg = crush(pg) # returns a list of osds
primary = osds_for_pg[0]
replicas = osds_for_pg[1:]6. CRUSH algorithm – CRUSH (Controlled Scalable Decentralized Placement of Replicated Data) provides deterministic, pseudo‑random placement of objects across OSDs based on a hierarchical cluster map and placement rules. Bucket types (list, tree, straw) and placement rule syntax are explained.
7. Custom QoS for Ceph RBD – To limit IO bandwidth, Ceph adopts a token‑bucket algorithm. The flow includes client async IO entering ImageRequestWQ, passing through TokenBucket for rate limiting, and then being processed. Framework diagrams illustrate the integration of the token bucket into the RBD stack.
Overall, the article serves as a comprehensive guide for engineers and architects working with Ceph, covering both theoretical concepts and practical implementation details.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.