Cloud Computing 27 min read

Unlocking Ceph: Deep Dive into Architecture, IO Flow, Heartbeat, CRUSH & QOS

This article provides a comprehensive overview of Ceph, covering its architecture and use cases, detailed IO processes, heartbeat mechanisms, communication framework, the CRUSH data placement algorithm, and customizable RBD QoS policies, illustrated with diagrams and code snippets to help readers understand and implement Ceph in cloud environments.

21CTO
21CTO
21CTO
Unlocking Ceph: Deep Dive into Architecture, IO Flow, Heartbeat, CRUSH & QOS

1. Ceph Architecture Overview and Use Cases

Ceph is a unified distributed storage system designed for performance, reliability and scalability. It originated from Sage research in 2004 and is now supported by many cloud vendors. It integrates with RedHat and OpenStack for VM image storage.

1.1 Ceph Features

High performance: uses the CRUSH algorithm for balanced data distribution and parallelism.

High availability: flexible replica count, fault‑domain isolation, self‑healing.

Scalability: decentralized, linear growth with added nodes.

Rich functionality: supports block, file and object interfaces, custom APIs and multiple language bindings.

1.2 Core Components

Monitor (MON) : small cluster of monitors that store cluster map metadata using Paxos.

Object Storage Device (OSD) : processes client requests and stores objects.

Metadata Server (MDS) : provides metadata for CephFS.

Object : lowest‑level storage unit containing data and metadata.

Placement Group (PG) : logical grouping of objects for placement.

RADOS : reliable autonomic distributed object store that drives data distribution and failover.

Librados : library used by higher‑level services (RBD, RGW, CephFS).

CRUSH : data placement algorithm.

RBD : block device service.

RGW : object gateway compatible with S3/Swift.

CephFS : POSIX‑compatible file system.

2. Ceph IO Process

2.1 Normal IO Flow

Client creates a cluster handler.

Client reads configuration.

Client connects to monitor to obtain cluster map.

Client uses CRUSH map to select primary OSD.

Primary OSD writes data to two replica OSDs.

Client waits for acknowledgments.

On success, client receives completion.

2.2 New Primary IO Flow

When a new OSD replaces the primary, it reports to the monitor, the old primary temporarily takes over, synchronizes data, and later transfers the primary role.

2.3 IO Algorithm Flow

File → Object mapping (inode, object number, oid). Object → PG mapping via hash and mask. PG → OSD mapping via CRUSH.

2.4 Ceph IO Pseudo Code

locator = object_name
obj_hash = hash(locator)
pg = obj_hash % num_pg
osds_for_pg = crush(pg)    # returns a list of osds
primary = osds_for_pg[0]
replicas = osds_for_pg[1:]

2.5 RBD IO Flow

Client creates a pool, creates an RBD image, data is split into 4 MiB objects, each object is placed in a PG, and three OSDs store replicas. OSDs format underlying disks (typically XFS).

2.6 RBD IO Framework

Client uses librbd to create a block device, calls librados to map pool → image → object → PG, then communicates with primary OSD, which forwards data to replicas.

2.7 Pool and PG Distribution

Pool acts as a namespace; each pool contains a configurable number of PGs, which are distributed across OSDs. Expansion adds OSDs and rebalances PGs automatically.

2.8 Data Expansion

When OSD count increases, PGs are migrated to new OSDs to keep distribution balanced.

3. Ceph Heartbeat Mechanism

3.1 Overview

Heartbeats detect node failures, balancing detection latency and network load.

3.2 Heartbeat Detection

OSD nodes listen on public, cluster, front and back ports. Heartbeat messages are exchanged every ~6 s; missing replies for 20 s mark a node as failed.

3.3 OSD‑OSD Heartbeat

OSDs within the same PG ping each other; lack of response after 20 s adds the peer to the failure queue.

3.4 OSD‑Monitor Heartbeat

OSDs report events, startup, and periodic status to monitors. Monitors aggregate failure reports and take nodes offline after thresholds are met, using a tolerant approach to network jitter.

3.5 Summary

Ceph combines peer OSD failure reports (seconds‑scale) with monitor aggregation (minutes‑scale) to achieve timely detection, low pressure on monitors, tolerance to network jitter, and efficient state propagation.

4. Ceph Communication Framework

4.1 Communication Types

Simple: one thread per direction per connection (high CPU cost).

Async: event‑driven I/O multiplexing (default).

XIO: uses accelio library (experimental).

4.2 Design Pattern

Publish/Subscribe (Observer) model where a Messenger publishes messages and Dispatcher subclasses subscribe to handle them.

4.3 Framework Flow

Accepter creates a Pipe for each peer, Pipe reads/writes messages, Messenger dispatches messages to registered Dispatchers via a DispatchQueue.

4.4 Class Diagram

4.5 Message Format

A message consists of a header, optional user data, payload, middle, data, and a footer. The header includes sequence, type, priority, version, lengths, source entity, and CRC. The footer contains CRCs, signature and flags.

class Message : public RefCountedObject {
protected:
  ceph_msg_header  header;
  ceph_msg_footer  footer;
  bufferlist       payload;
  bufferlist       middle;
  bufferlist       data;
  utime_t recv_stamp;
  utime_t dispatch_stamp;
  utime_t throttle_stamp;
  utime_t recv_complete_stamp;
  ConnectionRef connection;
  uint32_t magic = 0;
  bi::list_member_hook<> dispatch_q;
};

struct ceph_msg_header {
    __le64 seq;
    __le64 tid;
    __le16 type;
    __le16 priority;
    __le16 version;
    __le32 front_len;
    __le32 middle_len;
    __le32 data_len;
    __le16 data_off;
    struct ceph_entity_name src;
    __le16 compat_version;
    __le16 reserved;
    __le32 crc;
} __attribute__ ((packed));

struct ceph_msg_footer {
    __le32 front_crc, middle_crc, data_crc;
    __le64  sig;
    __u8 flags;
} __attribute__ ((packed));

5. Ceph CRUSH Algorithm

5.1 Data Distribution Challenges

Need balanced placement, load balancing, scalable cluster growth, and minimal metadata overhead.

5.2 CRUSH Overview

CRUSH (Controlled Scalable Decentralized Placement of Replicated Data) maps PGs to OSDs using a deterministic pseudo‑random process, ensuring the same PG always maps to the same set of OSDs.

5.3 CRUSH Principles

5.3.1 Hierarchical Cluster Map

The map reflects physical topology (root → datacenter → rack → host → OSD) and enables fault‑domain awareness.

5.3.2 Placement Rules

Rules define where replicas are placed, which failure domains to use, and the search strategy (breadth‑first or depth‑first).

5.3.3 Bucket Types

Uniform buckets: equal weight, few changes.

List buckets: optimal data movement on expansion, O(n) lookup.

Tree buckets: O(log n) lookup, stable IDs.

Straw buckets: random “straw” lengths for fair competition, minimal data movement.

5.4 Example Rule

rule replicated_ruleset {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

5.5 CRUSH Use Case

Assign high‑priority workloads to SSD‑backed OSDs by creating a pool with a rule that selects only SSD hosts.

6. Customizable Ceph RBD QoS

6.1 QoS Introduction

QoS (Quality of Service) limits I/O bandwidth and IOPS to guarantee performance for priority tenants.

6.2 IO Types

ClientOp: client read/write requests.

SubOp: OSD‑to‑OSD replication, recovery, etc.

SnapTrim: snapshot deletion.

Scrub/Deep Scrub: data integrity checks.

Recovery: data migration after failures or scaling.

6.3 Official QoS (mClock)

mClock schedules I/O based on reservation, weight, and limit parameters.

6.4 Token‑Bucket QoS

6.4.1 Token Bucket Basics

Tokens are added at a configured rate; each I/O consumes tokens proportional to its size. If insufficient tokens exist, the request is delayed.

6.4.2 RBD Token‑Bucket Flow

Client issues async I/O to an image.

Request enters ImageRequestWQ.

Before execution, the request passes through a TokenBucket.

TokenBucket enforces rate limits, then the request proceeds.

6.4.3 Framework Diagram

6.4.4 Integration Diagram

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud computingdistributed storageCephQoSCRUSHIO Architecture
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.