Fundamentals 42 min read

Ceph Storage Architecture Overview and Detailed Technical Features

This article provides a comprehensive technical overview of Red Hat Ceph, covering its distributed object storage design, cluster architecture, storage pools, authentication, placement groups, CRUSH algorithm, I/O operations, replication, erasure coding, internal management tasks, high availability, client interfaces, data striping, and encryption mechanisms.

Architects' Tech Alliance

Jan 25, 2021

Ceph Storage Architecture Overview and Detailed Technical Features

Chapter 1 Overview

Chapter 2 Storage Cluster Architecture

Chapter 3 Client Architecture

Chapter 4 Encryption

Chapter 1 Overview

Red Hat Ceph is a distributed object storage system designed for high performance, reliability, and scalability, supporting modern and legacy object interfaces such as native language bindings (C/C++, Java, Python), RESTful APIs (S3/Swift), block device interfaces, and file system interfaces.

Ceph can scale to thousands of clients accessing petabytes to exabytes of data, making it suitable for cloud platforms like RHEL OSP.

The core of any Ceph deployment is the Ceph storage cluster, which consists of two types of daemon processes: Ceph OSD daemons that store data and perform replication, rebalancing, recovery, and status reporting, and Ceph monitor daemons that maintain a master copy of the cluster map.

Clients interact with the cluster using a configuration file (or cluster name and monitor addresses), a pool name, and user credentials (keyring path). The client does not need to know the OSD placement of objects; the CRUSH algorithm computes the placement group (PG) and primary OSD based on the pool map.

Chapter 2 Storage Cluster Architecture

The cluster provides storage and retrieval, data replication, health monitoring, dynamic rebalancing, data integrity checks, and failure recovery. All client‑cluster interactions are transparent, while the CRUSH algorithm handles object placement.

2.1 Storage Pools

Pools logically partition data and can be configured for different types (replicated or erasure‑coded). They define pool type, placement groups (PGs), and CRUSH rule sets that control data distribution, fault domains, and performance domains.

2.2 Authentication

Ceph uses the CephX authentication system, which relies on shared secret keys for mutual authentication between clients and monitors. CephX does not provide in‑flight encryption.

2.3 Placement Groups (PGs)

Objects are hashed into PGs, and CRUSH maps each PG to an acting set of OSDs. Proper PG sizing is critical for performance and scalability.

2.4 CRUSH

CRUSH rule sets define hierarchical bucket types (e.g., hosts, racks, rows) and enable deterministic, scalable data placement without a central directory.

2.5 I/O Operations

Clients obtain the latest cluster map from monitors, then compute the target PG and primary OSD using CRUSH. The client sends I/O to the primary OSD, which coordinates replication to secondary OSDs.

2.5.1 Replicated I/O

The primary OSD writes the object and forwards it to secondary OSDs; an ACK is returned to the client only after all replicas are stored.

2.5.2 Erasure‑Coding I/O

Data is split into K data blocks and M coding blocks (e.g., 10/16). The primary OSD encodes the data, distributes blocks across OSDs, and maintains an authoritative log for recovery.

2.6 Self‑Managed Internal Operations

2.6.1 Heartbeat

OSDs report up/down status to monitors; monitors periodically ping OSDs to verify liveness.

2.6.2 Sync

OSDs synchronize PG state across the acting set to achieve consistency.

2.6.3 Rebalancing and Recovery

When new OSDs join, CRUSH recalculates placement, causing a small fraction of data to migrate for balanced distribution.

2.6.4 Scrubbing (Verification)

OSDs periodically compare object metadata and, optionally, data bits to detect corruption.

2.7 High Availability

2.7.1 Data Replication

Typical pools maintain three replicas; writes require at least two clean replicas. The system tolerates one OSD failure for reads/writes and up to two failures for reads only.

2.7.2 Monitor Cluster

Multiple monitors form a quorum using Paxos to avoid a single point of failure.

2.7.3 CephX

CephX provides mutual authentication similar to Kerberos, issuing session keys encrypted with the user's permanent key.

Chapter 3 Client Architecture

Ceph clients use RADOS (reliable autonomic distributed object store) protocols. Required prerequisites are a Ceph config file (or cluster name and monitor addresses), a pool name, and user credentials.

3.1 Native Protocol and Librados

Librados offers direct, parallel object access, supporting pool operations, snapshots, object read/write, xattr management, key/value operations, and compound operations with double‑ack semantics.

3.2 Object Watch and Notify

Clients can register persistent watches on objects, receiving notifications from the primary OSD, enabling objects to serve as synchronization channels.

3.3 Exclusive Locks

Exclusive locks allow a single client to obtain an exclusive lock on an RBD image, preventing concurrent writes and protecting against stale clients. The feature is enabled by adding --image-features 5 (bits 1 and 4) when creating an image.

3.4 Object Map Index

When enabled, the client maintains an in‑memory index of existing RADOS objects, allowing operations such as resize, export, copy, flatten, delete, and read to skip non‑existent objects. It is activated with --image-features 13 (bits 1, 4, 8) during image creation.

3.5 Data Striping

Ceph provides RAID‑0‑like striping to improve throughput. Parameters include object size, stripe width (unit size), and stripe count (number of objects in a stripe set). Proper tuning is essential for performance.

Chapter 4 Encryption

LUKS disk encryption can protect OSD data partitions. Ceph‑ansible invokes ceph‑disk to create encrypted partitions, a lockbox partition, and a client.osd‑lockbox user that stores the LUKS key. The key is stored in the monitor’s KV store and automatically unlocked at service start.

Additional promotional text and references to external e‑books are present but omitted from the technical summary.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

encryption distributed storage erasure coding Ceph CRUSH Data Striping

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Table of Contents

Chapter 1 Overview

Chapter 2 Storage Cluster Architecture

Chapter 3 Client Architecture

Chapter 4 Encryption

Chapter 1 Overview

Chapter 2 Storage Cluster Architecture

2.1 Storage Pools

2.2 Authentication

2.3 Placement Groups (PGs)

2.4 CRUSH

2.5 I/O Operations

2.5.1 Replicated I/O

2.5.2 Erasure‑Coding I/O

2.6 Self‑Managed Internal Operations

2.6.1 Heartbeat

2.6.2 Sync

2.6.3 Rebalancing and Recovery

2.6.4 Scrubbing (Verification)

2.7 High Availability

2.7.1 Data Replication

2.7.2 Monitor Cluster

2.7.3 CephX

Chapter 3 Client Architecture

3.1 Native Protocol and Librados

3.2 Object Watch and Notify

3.3 Exclusive Locks

3.4 Object Map Index

3.5 Data Striping

Chapter 4 Encryption

Architects' Tech Alliance

How this landed with the community

Was this worth your time?

0 Comments