Ceph Storage Architecture Overview and Detailed Technical Features
This article provides a comprehensive technical overview of Red Hat Ceph, covering its distributed object storage design, cluster architecture, storage pools, authentication, placement groups, CRUSH algorithm, I/O operations, replication, erasure coding, internal management tasks, high availability, client interfaces, data striping, and encryption mechanisms.
Table of Contents
Chapter 1 Overview
Chapter 2 Storage Cluster Architecture
Chapter 3 Client Architecture
Chapter 4 Encryption
Chapter 1 Overview
Red Hat Ceph is a distributed object storage system designed for high performance, reliability, and scalability, supporting modern and legacy object interfaces such as native language bindings (C/C++, Java, Python), RESTful APIs (S3/Swift), block device interfaces, and file system interfaces.
Ceph can scale to thousands of clients accessing petabytes to exabytes of data, making it suitable for cloud platforms like RHEL OSP.
The core of any Ceph deployment is the Ceph storage cluster, which consists of two types of daemon processes: Ceph OSD daemons that store data and perform replication, rebalancing, recovery, and status reporting, and Ceph monitor daemons that maintain a master copy of the cluster map.
Clients interact with the cluster using a configuration file (or cluster name and monitor addresses), a pool name, and user credentials (keyring path). The client does not need to know the OSD placement of objects; the CRUSH algorithm computes the placement group (PG) and primary OSD based on the pool map.
Chapter 2 Storage Cluster Architecture
The cluster provides storage and retrieval, data replication, health monitoring, dynamic rebalancing, data integrity checks, and failure recovery. All client‑cluster interactions are transparent, while the CRUSH algorithm handles object placement.
2.1 Storage Pools
Pools logically partition data and can be configured for different types (replicated or erasure‑coded). They define pool type, placement groups (PGs), and CRUSH rule sets that control data distribution, fault domains, and performance domains.
2.2 Authentication
Ceph uses the CephX authentication system, which relies on shared secret keys for mutual authentication between clients and monitors. CephX does not provide in‑flight encryption.
2.3 Placement Groups (PGs)
Objects are hashed into PGs, and CRUSH maps each PG to an acting set of OSDs. Proper PG sizing is critical for performance and scalability.
2.4 CRUSH
CRUSH rule sets define hierarchical bucket types (e.g., hosts, racks, rows) and enable deterministic, scalable data placement without a central directory.
2.5 I/O Operations
Clients obtain the latest cluster map from monitors, then compute the target PG and primary OSD using CRUSH. The client sends I/O to the primary OSD, which coordinates replication to secondary OSDs.
2.5.1 Replicated I/O
The primary OSD writes the object and forwards it to secondary OSDs; an ACK is returned to the client only after all replicas are stored.
2.5.2 Erasure‑Coding I/O
Data is split into K data blocks and M coding blocks (e.g., 10/16). The primary OSD encodes the data, distributes blocks across OSDs, and maintains an authoritative log for recovery.
2.6 Self‑Managed Internal Operations
2.6.1 Heartbeat
OSDs report up/down status to monitors; monitors periodically ping OSDs to verify liveness.
2.6.2 Sync
OSDs synchronize PG state across the acting set to achieve consistency.
2.6.3 Rebalancing and Recovery
When new OSDs join, CRUSH recalculates placement, causing a small fraction of data to migrate for balanced distribution.
2.6.4 Scrubbing (Verification)
OSDs periodically compare object metadata and, optionally, data bits to detect corruption.
2.7 High Availability
2.7.1 Data Replication
Typical pools maintain three replicas; writes require at least two clean replicas. The system tolerates one OSD failure for reads/writes and up to two failures for reads only.
2.7.2 Monitor Cluster
Multiple monitors form a quorum using Paxos to avoid a single point of failure.
2.7.3 CephX
CephX provides mutual authentication similar to Kerberos, issuing session keys encrypted with the user's permanent key.
Chapter 3 Client Architecture
Ceph clients use RADOS (reliable autonomic distributed object store) protocols. Required prerequisites are a Ceph config file (or cluster name and monitor addresses), a pool name, and user credentials.
3.1 Native Protocol and Librados
Librados offers direct, parallel object access, supporting pool operations, snapshots, object read/write, xattr management, key/value operations, and compound operations with double‑ack semantics.
3.2 Object Watch and Notify
Clients can register persistent watches on objects, receiving notifications from the primary OSD, enabling objects to serve as synchronization channels.
3.3 Exclusive Locks
Exclusive locks allow a single client to obtain an exclusive lock on an RBD image, preventing concurrent writes and protecting against stale clients. The feature is enabled by adding --image-features 5 (bits 1 and 4) when creating an image.
3.4 Object Map Index
When enabled, the client maintains an in‑memory index of existing RADOS objects, allowing operations such as resize, export, copy, flatten, delete, and read to skip non‑existent objects. It is activated with --image-features 13 (bits 1, 4, 8) during image creation.
3.5 Data Striping
Ceph provides RAID‑0‑like striping to improve throughput. Parameters include object size, stripe width (unit size), and stripe count (number of objects in a stripe set). Proper tuning is essential for performance.
Chapter 4 Encryption
LUKS disk encryption can protect OSD data partitions. Ceph‑ansible invokes ceph‑disk to create encrypted partitions, a lockbox partition, and a client.osd‑lockbox user that stores the LUKS key. The key is stored in the monitor’s KV store and automatically unlocked at service start.
Additional promotional text and references to external e‑books are present but omitted from the technical summary.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.