Understanding Ceph Architecture: RADOS, OSD, Monitor, and Data Mapping
The article provides a comprehensive overview of Ceph’s distributed storage architecture, explaining the roles of RADOS, OSD, Monitor, and Metadata Cluster, and detailing the three-step data mapping process from file to object, to placement group, and finally to OSD storage.
Ceph started in 2004 as a unified distributed storage system designed for excellent performance, reliability, and scalability.
When using the RADOS system, a client program obtains the ClusterMap by interacting with OSDs or Monitors, computes the object’s storage location locally, and then communicates directly with the appropriate OSD to perform data operations, eliminating the need for a separate metadata server during normal access.
The ClusterMap is updated only when the system state changes, which typically occurs in two cases: an OSD failure or an expansion of the RADOS cluster. These events are far less frequent than client data accesses.
OSDs rely on the underlying filesystem’s extended attributes (Xattrs) to record object state and metadata. ext4 provides only 4 KB, XFS up to 64 KB, while Btrfs has no limit but is considered less stable; production deployments usually prefer XFS, with Btrfs used for testing.
Client: Deployed on Linux servers, it slices data, uses the CRUSH algorithm to locate objects, and performs read/write operations.
OSD: Stores data, handles replication, recovery, back‑filling, and reports status to the Monitor. A minimum of two OSDs is required per cluster, each typically backed by a physical disk using Btrfs, XFS, or ext4.
Monitor: Monitors OSD health, maintains OSD, Placement Group (PG), and CRUSH mappings.
Metadata Cluster: Manages file metadata and is required only for CephFS.
The logical architecture of Ceph is crucial for understanding data layout:
A Cluster can be divided into multiple Pools .
Each Pool contains several logical Placement Groups (PGs) and defines the replica count.
A physical file is split into multiple Objects .
Each Object maps to a PG; a PG can contain many Objects.
A PG maps to a set of OSDs, with the first OSD acting as the Primary and the others as Secondaries.
Many PGs can map to the same OSD, and an OSD can host hundreds of PGs.
The PG concept decouples OSDs from individual Objects, enabling dynamic remapping when OSDs are added or fail.
The data addressing process consists of three mapping stages:
1. File → Object Mapping The file is split into fixed‑size Objects (typically 2 MB or 4 MB) to allow parallel processing and efficient management by RADOS.
2. Object → PG Mapping Each Object is hashed to obtain a PG ID using the formula Hash(OID) & Mask → PGID .
3. PG → OSD Mapping The PG ID is fed into the CRUSH algorithm, which selects a set of N OSDs responsible for storing and maintaining the PG’s Objects.
In practice, an OSD typically participates in hundreds of PGs, each replicated on multiple OSDs, resulting in thousands of intra‑OSD status exchanges. Without PGs, an OSD would need to exchange information for millions of Objects, which would be prohibitively costly.
Warm Tip: Scan the QR code to follow the official account and click the original link for more Ceph, cloud computing, micro‑service, and architecture resources.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.