Fundamentals 26 min read

Ceph Distributed Storage System: Architecture, IO Flow, Heartbeat, CRUSH Algorithm, and Custom QOS

This article provides a comprehensive overview of the Ceph distributed storage system, covering its architecture, storage types, internal IO processes, heartbeat mechanisms, communication framework, CRUSH data placement algorithm, and customizable RBD QoS using token‑bucket scheduling.

Architecture Digest
Architecture Digest
Architecture Digest
Ceph Distributed Storage System: Architecture, IO Flow, Heartbeat, CRUSH Algorithm, and Custom QOS

1. Ceph Architecture and Use Cases

Ceph is a unified distributed storage system that offers block, file, and object storage with high performance, high availability, and linear scalability, and is widely integrated with OpenStack and RedHat for virtual machine image storage.

1.1 Ceph Overview

Ceph originated from research in 2004 and has evolved into a widely adopted open‑source storage solution.

1.2 Ceph Features

Key features include decentralized data placement using the CRUSH algorithm, flexible replica counts, fault‑domain isolation, and support for thousands of nodes from TB to PB scale.

1.3 Ceph Architecture

Ceph provides three storage interfaces: Object (compatible with S3/Swift), Block (supporting thin provisioning, snapshots, cloning), and File (POSIX with snapshots).

1.4 Core Components and Concepts

Monitors (Mon) maintain cluster maps, OSDs store objects, MDS provides metadata for CephFS, and RADOS is the underlying reliable autonomic distributed object store accessed via librados.

1.5 Storage Types

Block storage uses RAID/LVM for redundancy, file storage relies on FTP/NFS for sharing, and object storage serves large unstructured data such as images and videos.

2. Ceph IO Process

The client creates a cluster handler, reads configuration, connects to monitors to obtain the cluster map, and then uses the CRUSH algorithm to map objects to placement groups (PG) and OSDs. Data is written to a primary OSD and replicated to secondary OSDs before acknowledging the client.

2.1 Normal IO Flow

Client creates cluster handler.

Client reads config file.

Client connects to monitor and gets cluster map.

Client selects OSDs based on CRUSH mapping.

Primary OSD writes data and replicates to two other OSDs.

Client receives acknowledgment after all replicas confirm.

2.2 New Primary IO Flow

When a new OSD replaces a failed primary, it reports to the monitor, the old primary temporarily takes over, synchronizes data, and after full sync the new OSD becomes the primary.

2.3 IO Algorithm Flow (Pseudo‑code)

locator = object_name
obj_hash = hash(locator)
pg = obj_hash % num_pg
osds_for_pg = crush(pg)  # returns a list of osds
primary = osds_for_pg[0]
replicas = osds_for_pg[1:]

2.4 RBD IO Flow

Clients create a pool, specify PG count, create an RBD image, and write data in 4 MiB objects. Each object is placed on three OSDs via CRUSH, and OSDs format underlying disks (typically XFS) to store the objects.

3. Ceph Heartbeat Mechanism

Heartbeats detect node failures. OSDs listen on public, cluster, front, and back ports, and exchange PING/PONG messages every ~6 seconds. If no reply is received for 20 seconds, the OSD is marked as failed and reported to monitors.

3.1 OSD‑OSD Heartbeat

OSDs within the same PG exchange heartbeats; failure is added to a failure queue after timeout.

3.2 OSD‑Monitor Heartbeat

OSDs report events, startup status, and periodic health to monitors, which aggregate reports and decide when to mark an OSD down based on thresholds.

4. Ceph Communication Framework

Three implementation modes exist: Simple (one thread per connection), Async (event‑driven I/O multiplexing, default in recent releases), and XIO (experimental, using the accelio library).

4.1 Design Pattern

The framework follows a publish/subscribe (observer) pattern where messengers publish messages and dispatcher subclasses subscribe to handle them.

4.2 Message Structure

class Message : public RefCountedObject {
protected:
  ceph_msg_header  header;
  ceph_msg_footer  footer;
  bufferlist       payload;   // front unaligned blob
  bufferlist       middle;    // middle unaligned blob
  bufferlist       data;      // data payload (page‑aligned when possible)
  // timestamps and connection info omitted for brevity
};

struct ceph_msg_header {
  __le64 seq;       // per‑session sequence number
  __le64 tid;       // global unique id
  __le16 type;      // message type
  __le16 priority;  // priority
  __le16 version;   // version
  __le32 front_len; // payload length
  __le32 middle_len;
  __le32 data_len;
  __le16 data_off;  // data offset
  struct ceph_entity_name src; // source entity
  __le16 compat_version;
  __le16 reserved;
  __le32 crc;       // header CRC32C
} __attribute__ ((packed));

struct ceph_msg_footer {
  __le32 front_crc, middle_crc, data_crc; // CRC checksums
  __le64  sig;   // 64‑bit signature
  __u8   flags;  // end‑of‑message flags
} __attribute__ ((packed));

5. Ceph CRUSH Algorithm

CRUSH (Controlled Scalable Decentralized Placement of Replicated Data) provides balanced data distribution, fault‑domain awareness, and efficient scaling. It uses a hierarchical cluster map (buckets) and placement rules to deterministically map PGs to OSDs.

5.1 Challenges

Ensuring uniform data distribution, load balancing, and minimal data movement during cluster expansion or contraction.

5.2 Placement Rules

Rules define the starting bucket, failure domain, and search strategy (breadth‑first or depth‑first) for replica placement.

5.3 Bucket Types

Standard buckets – equal weight, static.

List buckets – optimal for cluster expansion, O(n) lookup.

Tree buckets – O(log n) lookup, stable node IDs.

Straw buckets – random‑length “straws” for fair competition, optimal data movement.

5.4 Example Rule

rule replicated_ruleset {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

6. Customizable Ceph RBD QoS

QoS (Quality of Service) limits I/O bandwidth and IOPS per client to prevent resource contention. Ceph classifies I/O into ClientOp, SubOp, SnapTrim, Scrub, and Recovery.

6.1 Official QoS (mClock)

mClock is a time‑slot based scheduler using reservation, weight, and limit parameters to allocate I/O resources.

6.2 Token‑Bucket QoS Implementation

A token bucket refills at a configured rate; I/O packets consume tokens proportional to their size. If insufficient tokens exist, the request is delayed until tokens are available, enforcing bandwidth caps.

6.3 RBD Token‑Bucket Flow

User issues asynchronous I/O to an RBD image.

Requests enter the ImageRequestWQ queue.

Before execution, each request passes through the TokenBucket filter.

The token bucket throttles the request, then it proceeds to the image handler.

6.4 Token‑Bucket Framework Diagram

Diagrammatic representation omitted for brevity; it shows the interaction between the request queue, token bucket, and I/O processing pipeline.

Author: 李 航, Expert Engineer at 滴滴云, specializing in distributed storage Ceph.

Distributed StorageCephheartbeatIO FlowQoSCRUSH
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.