Unlocking Ceph: Deep Dive into Architecture, IO Flow, Heartbeat, CRUSH & QOS
This article provides a comprehensive overview of Ceph, covering its architecture and use cases, detailed IO processes, heartbeat mechanisms, communication framework, the CRUSH data placement algorithm, and customizable RBD QoS policies, illustrated with diagrams and code snippets to help readers understand and implement Ceph in cloud environments.
1. Ceph Architecture Overview and Use Cases
Ceph is a unified distributed storage system designed for performance, reliability and scalability. It originated from Sage research in 2004 and is now supported by many cloud vendors. It integrates with RedHat and OpenStack for VM image storage.
1.1 Ceph Features
High performance: uses the CRUSH algorithm for balanced data distribution and parallelism.
High availability: flexible replica count, fault‑domain isolation, self‑healing.
Scalability: decentralized, linear growth with added nodes.
Rich functionality: supports block, file and object interfaces, custom APIs and multiple language bindings.
1.2 Core Components
Monitor (MON) : small cluster of monitors that store cluster map metadata using Paxos.
Object Storage Device (OSD) : processes client requests and stores objects.
Metadata Server (MDS) : provides metadata for CephFS.
Object : lowest‑level storage unit containing data and metadata.
Placement Group (PG) : logical grouping of objects for placement.
RADOS : reliable autonomic distributed object store that drives data distribution and failover.
Librados : library used by higher‑level services (RBD, RGW, CephFS).
CRUSH : data placement algorithm.
RBD : block device service.
RGW : object gateway compatible with S3/Swift.
CephFS : POSIX‑compatible file system.
2. Ceph IO Process
2.1 Normal IO Flow
Client creates a cluster handler.
Client reads configuration.
Client connects to monitor to obtain cluster map.
Client uses CRUSH map to select primary OSD.
Primary OSD writes data to two replica OSDs.
Client waits for acknowledgments.
On success, client receives completion.
2.2 New Primary IO Flow
When a new OSD replaces the primary, it reports to the monitor, the old primary temporarily takes over, synchronizes data, and later transfers the primary role.
2.3 IO Algorithm Flow
File → Object mapping (inode, object number, oid). Object → PG mapping via hash and mask. PG → OSD mapping via CRUSH.
2.4 Ceph IO Pseudo Code
locator = object_name
obj_hash = hash(locator)
pg = obj_hash % num_pg
osds_for_pg = crush(pg) # returns a list of osds
primary = osds_for_pg[0]
replicas = osds_for_pg[1:]2.5 RBD IO Flow
Client creates a pool, creates an RBD image, data is split into 4 MiB objects, each object is placed in a PG, and three OSDs store replicas. OSDs format underlying disks (typically XFS).
2.6 RBD IO Framework
Client uses librbd to create a block device, calls librados to map pool → image → object → PG, then communicates with primary OSD, which forwards data to replicas.
2.7 Pool and PG Distribution
Pool acts as a namespace; each pool contains a configurable number of PGs, which are distributed across OSDs. Expansion adds OSDs and rebalances PGs automatically.
2.8 Data Expansion
When OSD count increases, PGs are migrated to new OSDs to keep distribution balanced.
3. Ceph Heartbeat Mechanism
3.1 Overview
Heartbeats detect node failures, balancing detection latency and network load.
3.2 Heartbeat Detection
OSD nodes listen on public, cluster, front and back ports. Heartbeat messages are exchanged every ~6 s; missing replies for 20 s mark a node as failed.
3.3 OSD‑OSD Heartbeat
OSDs within the same PG ping each other; lack of response after 20 s adds the peer to the failure queue.
3.4 OSD‑Monitor Heartbeat
OSDs report events, startup, and periodic status to monitors. Monitors aggregate failure reports and take nodes offline after thresholds are met, using a tolerant approach to network jitter.
3.5 Summary
Ceph combines peer OSD failure reports (seconds‑scale) with monitor aggregation (minutes‑scale) to achieve timely detection, low pressure on monitors, tolerance to network jitter, and efficient state propagation.
4. Ceph Communication Framework
4.1 Communication Types
Simple: one thread per direction per connection (high CPU cost).
Async: event‑driven I/O multiplexing (default).
XIO: uses accelio library (experimental).
4.2 Design Pattern
Publish/Subscribe (Observer) model where a Messenger publishes messages and Dispatcher subclasses subscribe to handle them.
4.3 Framework Flow
Accepter creates a Pipe for each peer, Pipe reads/writes messages, Messenger dispatches messages to registered Dispatchers via a DispatchQueue.
4.4 Class Diagram
4.5 Message Format
A message consists of a header, optional user data, payload, middle, data, and a footer. The header includes sequence, type, priority, version, lengths, source entity, and CRC. The footer contains CRCs, signature and flags.
class Message : public RefCountedObject {
protected:
ceph_msg_header header;
ceph_msg_footer footer;
bufferlist payload;
bufferlist middle;
bufferlist data;
utime_t recv_stamp;
utime_t dispatch_stamp;
utime_t throttle_stamp;
utime_t recv_complete_stamp;
ConnectionRef connection;
uint32_t magic = 0;
bi::list_member_hook<> dispatch_q;
};
struct ceph_msg_header {
__le64 seq;
__le64 tid;
__le16 type;
__le16 priority;
__le16 version;
__le32 front_len;
__le32 middle_len;
__le32 data_len;
__le16 data_off;
struct ceph_entity_name src;
__le16 compat_version;
__le16 reserved;
__le32 crc;
} __attribute__ ((packed));
struct ceph_msg_footer {
__le32 front_crc, middle_crc, data_crc;
__le64 sig;
__u8 flags;
} __attribute__ ((packed));5. Ceph CRUSH Algorithm
5.1 Data Distribution Challenges
Need balanced placement, load balancing, scalable cluster growth, and minimal metadata overhead.
5.2 CRUSH Overview
CRUSH (Controlled Scalable Decentralized Placement of Replicated Data) maps PGs to OSDs using a deterministic pseudo‑random process, ensuring the same PG always maps to the same set of OSDs.
5.3 CRUSH Principles
5.3.1 Hierarchical Cluster Map
The map reflects physical topology (root → datacenter → rack → host → OSD) and enables fault‑domain awareness.
5.3.2 Placement Rules
Rules define where replicas are placed, which failure domains to use, and the search strategy (breadth‑first or depth‑first).
5.3.3 Bucket Types
Uniform buckets: equal weight, few changes.
List buckets: optimal data movement on expansion, O(n) lookup.
Tree buckets: O(log n) lookup, stable IDs.
Straw buckets: random “straw” lengths for fair competition, minimal data movement.
5.4 Example Rule
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}5.5 CRUSH Use Case
Assign high‑priority workloads to SSD‑backed OSDs by creating a pool with a rule that selects only SSD hosts.
6. Customizable Ceph RBD QoS
6.1 QoS Introduction
QoS (Quality of Service) limits I/O bandwidth and IOPS to guarantee performance for priority tenants.
6.2 IO Types
ClientOp: client read/write requests.
SubOp: OSD‑to‑OSD replication, recovery, etc.
SnapTrim: snapshot deletion.
Scrub/Deep Scrub: data integrity checks.
Recovery: data migration after failures or scaling.
6.3 Official QoS (mClock)
mClock schedules I/O based on reservation, weight, and limit parameters.
6.4 Token‑Bucket QoS
6.4.1 Token Bucket Basics
Tokens are added at a configured rate; each I/O consumes tokens proportional to its size. If insufficient tokens exist, the request is delayed.
6.4.2 RBD Token‑Bucket Flow
Client issues async I/O to an image.
Request enters ImageRequestWQ.
Before execution, the request passes through a TokenBucket.
TokenBucket enforces rate limits, then the request proceeds.
6.4.3 Framework Diagram
6.4.4 Integration Diagram
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
