Fundamentals 40 min read

Unlocking Ceph: A Deep Dive into Distributed Storage Architecture and Features

This article provides a comprehensive overview of Red Hat Ceph’s distributed object‑storage architecture, covering storage pools, authentication, placement groups, the CRUSH algorithm, replication, erasure coding, internal operations, high‑availability mechanisms, client interfaces, and encryption, illustrated with diagrams and practical details.

Open Source Linux

Jan 27, 2021

Unlocking Ceph: A Deep Dive into Distributed Storage Architecture and Features

Overview

Red Hat Ceph is a distributed object‑storage system designed for high performance, reliability, and scalability, supporting native language bindings (C/C++, Java, Python), RESTful APIs (S3/Swift), block‑device interfaces, and file‑system interfaces.

It can serve petabyte to exabyte‑scale data and is a core component of cloud platforms such as RHEL OSP.

Storage Cluster Architecture

The Ceph cluster consists of two main daemon types:

Ceph OSD daemon: Stores data, handles replication, rebalancing, recovery, and status reporting.

Ceph monitor: Maintains a master copy of the cluster map.

Clients interact with the cluster using a configuration file (or cluster name and monitor addresses), a storage‑pool name, and user credentials.

Clients obtain the latest cluster map from a monitor, then use the CRUSH algorithm to map an object name and pool to a placement group (PG) and the primary OSD, allowing direct read/write without an intermediate server.

Key Concepts

Storage Pools

Pools logically partition data and define storage‑pool type (replicated or erasure‑coded), PG count, and CRUSH rules. Replicated pools keep multiple copies of each object; erasure‑coded pools split objects into K data blocks and M coding blocks, tolerating up to M OSD failures.

Authentication (CephX)

CephX uses shared secret keys for mutual authentication between clients and monitors, providing session‑key based security without handling in‑flight or static data encryption.

Placement Groups (PGs)

Objects are hashed into PGs, which are then assigned to an acting set of OSDs by CRUSH. Proper PG sizing balances load and performance.

CRUSH Algorithm

CRUSH deterministically maps PGs to OSDs based on a hierarchical bucket structure that reflects failure and performance domains (e.g., racks, rows, device types). It enables data placement without a central directory and supports dynamic rebalancing when OSDs join or fail.

I/O Operations

Client supplies pool ID and object ID.

CRUSH hashes the object ID.

Hash modulo PG count yields the PG ID.

CRUSH determines the primary OSD for that PG.

Client contacts the primary OSD to perform read/write.

Both replicated and erasure‑coded pools follow this flow, with the primary OSD coordinating writes to secondary OSDs.

Replication I/O

The primary OSD writes the object and then forwards it to secondary OSDs. Once all required replicas acknowledge, the client receives a success response.

Erasure Coding I/O

Objects are split into K data blocks and M coding blocks. The primary OSD encodes and distributes these blocks across OSDs. Reconstruction reads any K blocks to recover the original data, tolerating up to M OSD failures.

Self‑Managed Internal Operations

Heartbeat: OSDs report up/down status to monitors.

Sync: OSDs synchronize PG state.

Rebalancing & Recovery: Data migrates when OSDs are added or fail.

Scrubbing (or trimming): Periodic consistency checks and cleanup.

High Availability

Ceph maintains service continuity even when a monitor or OSD fails, provided a majority of monitors remain reachable and sufficient replicas or erasure‑coding redundancy exist.

Client Architecture

Clients access Ceph via:

Native protocols and librados for direct object operations.

Object watch/notify for asynchronous updates.

Exclusive locks to serialize access to RBD images.

Object‑map index to track existing objects and avoid unnecessary operations.

Data striping to improve throughput by spreading writes across multiple objects/OSDs.

Typical client commands include creating RBD images with specific features (e.g.,

rbd -p mypool create myimage --size 102400 --image-features 13

to enable layering, exclusive lock, and object‑map index).

Encryption

Ceph can use LUKS‑encrypted OSD partitions created via ceph‑ansible and ceph‑disk. A small lockbox partition stores the LUKS key, which is placed in the monitor’s KV store and used by each OSD at startup to unlock data and journal partitions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Replication storage architecture erasure coding Ceph CRUSH

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.