Red Hat Ceph Storage Architecture Guide – Overview and Core Concepts
This article provides a comprehensive overview of Red Hat Ceph's distributed object storage architecture, covering storage pools, CRUSH placement, authentication, I/O workflows, internal operations, client interfaces, data striping, erasure coding, high availability, and encryption mechanisms for secure, scalable deployments.
Chapter 1 Overview
Red Hat Ceph is a distributed object storage system designed for performance, reliability and scalability, supporting multiple client interfaces such as native language bindings (C/C++, Java, Python), RESTful S3/Swift, block device and file system.
It can scale to thousands of clients and petabyte‑to‑exabyte data volumes, with core components being Ceph OSD daemons (handling data replication, rebalancing, recovery, monitoring) and Ceph monitors (maintaining cluster maps).
Chapter 2 Storage Cluster Architecture
2.1 Storage Pools
Storage pools logically partition data and can be configured for replicated or erasure‑coded durability. Pools define the type (replicated or EC), placement groups (PGs) and CRUSH rule sets that control data placement, fault domains and performance domains.
2.2 Authentication (CephX)
CephX provides mutual authentication using shared secret keys, similar to Kerberos, without encrypting data in transit.
2.3 Placement Groups (PGs)
Objects are mapped to PGs, which are then mapped to an acting set of OSDs via the CRUSH algorithm, enabling dynamic rebalancing and high scalability.
2.4 CRUSH
CRUSH deterministically maps PGs to OSDs based on hierarchical bucket definitions, allowing placement across fault and performance domains.
2.5 I/O Operations
Clients obtain the latest cluster map from monitors, compute the target PG and primary OSD using CRUSH, and interact directly with the primary OSD for reads and writes.
2.5.1 Replicated I/O
The primary OSD writes data and forwards it to replica OSDs; acknowledgments from replicas confirm successful writes.
2.5.2 Erasure‑coded I/O
Data is split into K data blocks and M coding blocks; the primary OSD encodes and distributes blocks across OSDs, allowing reconstruction if up to M OSDs fail.
2.6 Internal Operations
2.6.1 Heartbeat
OSDs report up/down status to monitors, which ping OSDs to verify liveness.
2.6.2 Sync
OSDs synchronize PG state internally without manual intervention.
2.6.3 Data Rebalancing and Recovery
When OSDs are added or fail, CRUSH recalculates placement and only a fraction of data moves, ensuring balanced load.
2.6.4 Scrubbing
Periodic scrubbing validates object metadata and data integrity, detecting corruption.
2.7 High Availability
Ceph maintains data availability through multiple replicas, monitor quorum, and the CephX authentication mechanism.
Chapter 3 Client Architecture
3.1 Native Protocol and Librados
Provides direct, parallel object access with operations such as pool management, snapshots, read/write, xattr and key/value handling.
3.2 Object Watch/Notify
Clients can register watches on objects and receive notifications for changes.
3.3 Exclusive Locks
Exclusive locks prevent concurrent writes to the same RBD image, improving consistency.
3.4 Object Map Index
Tracks existence of RADOS objects to avoid unnecessary operations on non‑existent objects.
3.5 Data Striping
Striping splits data across multiple objects to increase throughput; parameters include object size, stripe unit, and stripe count.
rbd -p mypool create myimage --size 102400 --image-features 5 rbd -p mypool create myimage --size 102400 --image-features 13Chapter 4 Encryption
Ceph can use LUKS disk encryption for OSD data and journal partitions, managed by ceph‑ansible and stored securely in monitor key‑value stores.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.