Cloud Computing 24 min read

Ceph Architecture Overview: Features, Core Components, IO Workflow, Heartbeat, Communication Framework, CRUSH Algorithm, and QoS

This article provides a comprehensive technical overview of Ceph, covering its architecture, key components, storage interfaces, IO processes, heartbeat mechanisms, communication framework, CRUSH data placement algorithm, and customizable QoS strategies for distributed storage systems.

Architecture Digest
Architecture Digest
Architecture Digest
Ceph Architecture Overview: Features, Core Components, IO Workflow, Heartbeat, Communication Framework, CRUSH Algorithm, and QoS

Ceph Overview

Ceph is a unified distributed storage system designed for high performance, reliability, and scalability. Originating from research in 2004, it is now widely supported by cloud providers and integrates with platforms such as RedHat and OpenStack for backend VM image storage.

Key Characteristics

High performance using the CRUSH algorithm for balanced data distribution and parallelism.

High availability with flexible replica counts, fault‑domain isolation, and automatic self‑healing.

Scalability through decentralized design, allowing linear growth as nodes are added.

Rich feature set supporting block, file, and object storage interfaces, as well as custom drivers.

Architecture and Core Components

Ceph provides three storage interfaces:

Object : Native API compatible with Swift and S3.

Block : Supports thin provisioning, snapshots, and cloning.

File : POSIX interface with snapshot capability.

Core components include:

Monitor (MON) : Small cluster of monitors that use Paxos to store cluster maps.

Object Storage Device (OSD) : Processes that store data objects and handle client requests.

Metadata Server (MDS) : Provides metadata services for CephFS.

RADOS : The underlying reliable autonomic distributed object store.

CRUSH : Data placement algorithm.

RBD, RGW, CephFS : Block, object, and file services exposed to users.

Storage Types

Block Storage

Typical devices are disk arrays; advantages include RAID/LVM protection, increased capacity, and improved I/O throughput. Drawbacks are higher cost for SAN fabrics and lack of data sharing between hosts. Common use cases are VM disks, logs, and general file storage.

File Storage

Implemented via FTP or NFS servers, offering low cost and easy file sharing. Limitations are lower read/write and transfer speeds. Used for log storage and hierarchical file systems.

Object Storage

Built on large‑capacity distributed servers compatible with Swift/S3. Provides high‑speed block‑like I/O and sharing features of file storage. Ideal for infrequently changed data such as images and videos.

Ceph IO Workflow

Typical client IO steps:

Client creates a cluster handler and reads the configuration.

Client connects to monitors to obtain the cluster map.

Client uses the CRUSH map to locate the primary OSD for the target PG.

Primary OSD writes data and replicates to secondary OSDs.

After all replicas acknowledge, the client receives completion.

When a new OSD becomes primary, it reports to monitors, a temporary primary OSD takes over, synchronizes data, and later hands over the role once synchronization finishes.

IO Algorithm Details

File → Object mapping uses inode (ino) and object number (ono) to form an object ID (oid). The oid is hashed and masked to produce a PG ID, which the CRUSH algorithm maps to a set of OSDs.

locator = object_name
obj_hash = hash(locator)
pg = obj_hash % num_pg
osds_for_pg = crush(pg)  # returns a list of osds
primary = osds_for_pg[0]
replicas = osds_for_pg[1:]

RBD IO Process

Clients create a pool with a defined number of PGs, mount an RBD image, and write data in 4 MiB chunks. Each chunk becomes an object that is placed on three OSDs via the CRUSH algorithm. The OSDs format underlying disks (typically XFS) and store objects as files such as rbd0.object1.file .

Heartbeat Mechanism

Heartbeats detect node failures. OSDs listen on public, cluster, front, and back ports. They exchange PING/PONG messages within the same PG every ~6 seconds; missing replies for 20 seconds trigger failure handling. Monitors also receive periodic reports from OSDs and aggregate failure information before marking an OSD down.

Communication Framework

Ceph supports three networking models: simple thread‑per‑connection, async I/O multiplexing (default), and XIO using the Accelio library. The framework follows a publish/subscribe pattern where a Messenger publishes messages to Dispatcher subclasses via Pipe objects. Messages consist of a header, user data (payload, middle, data), and a footer.

class Message : public RefCountedObject {
protected:
  ceph_msg_header  header;   // message header
  ceph_msg_footer  footer;   // message footer
  bufferlist       payload;   // front unaligned blob
  bufferlist       middle;    // middle unaligned blob
  bufferlist       data;      // data payload (page‑aligned when possible)
  utime_t         recv_stamp;      // receive start timestamp
  utime_t         dispatch_stamp;  // dispatch timestamp
  utime_t         throttle_stamp;  // throttle timestamp
  utime_t         recv_complete_stamp; // receive complete timestamp
  ConnectionRef   connection;      // network connection
  uint32_t        magic = 0;       // magic number
  bi::list_member_hook<> dispatch_q; // intrusive list hook
};

struct ceph_msg_header {
  __le64 seq;   // per‑session sequence number
  __le64 tid;   // global message ID
  __le16 type;  // message type
  __le16 priority;
  __le16 version;
  __le32 front_len;
  __le32 middle_len;
  __le32 data_len;
  __le16 data_off;
  ceph_entity_name src; // source entity
  __le16 compat_version;
  __le16 reserved;
  __le32 crc;   // header CRC32C
} __attribute__((packed));

struct ceph_msg_footer {
  __le32 front_crc, middle_crc, data_crc;
  __le64 sig;   // 64‑bit signature
  __u8  flags;  // end flag
} __attribute__((packed));

CRUSH Algorithm

CRUSH (Controlled Scalable Decentralized Placement of Replicated Data) maps placement groups (PGs) to OSDs using a hierarchical cluster map and placement rules. It supports various bucket types (uniform, list, tree, straw) to achieve balanced data distribution, fault‑domain awareness, and efficient scaling.

rule replicated_ruleset {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

Customizable QoS

Ceph’s QoS aims to allocate limited I/O resources (bandwidth, IOPS) among users. The official mClock scheduler provides reservation, weight, and limit parameters. A simpler token‑bucket implementation can also enforce rate limits by classifying requests, consuming tokens proportional to request size, and throttling when tokens are insufficient.

Author Information

Written by Li Hang, a senior developer with extensive experience in high‑performance Nginx, Redis Cluster, and Ceph, currently working at Didi’s infrastructure platform.

Distributed StorageCephheartbeatQoSCRUSHIO Process
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.