Understanding Ceph Architecture: RADOS, OSD, PG Mapping and Data Placement
This article explains Ceph's distributed storage architecture, covering its origins, RADOS client interactions, cluster map updates, the roles of OSDs, Monitors, metadata clusters, and the three-step mapping process from files to objects, placement groups, and finally to storage devices using the CRUSH algorithm.
Ceph, initiated in 2004, is a unified distributed storage system designed for high performance, reliability, and scalability.
Clients interact with the RADOS system by obtaining a ClusterMap from OSDs or Monitors, performing local calculations to locate objects, and communicating directly with the appropriate OSDs, eliminating the need for a separate metadata server as long as the ClusterMap remains stable.
ClusterMap updates occur only when OSDs fail or the cluster expands, events that are far less frequent than normal data accesses.
OSDs rely on underlying filesystem xattrs for object state and metadata; ext4 provides only 4 KB, XFS 64 KB, while Btrfs has no limit but is less stable, making XFS a recommended choice for production.
The Ceph logical architecture includes Clients (data slicing and CRUSH-based object location), OSDs (data storage, replication, recovery, and reporting), Monitors (cluster state monitoring and mapping), and optional Metadata Clusters for CephFS.
A cluster is divided into Pools, each containing multiple Placement Groups (PGs); objects are split from files, mapped to PGs, and PGs are then mapped to a set of OSDs via the CRUSH algorithm, enabling dynamic object-to-OSD placement and simplifying data distribution.
Without PGs, OSDs would need to exchange information for millions of objects, leading to prohibitive maintenance overhead; PGs reduce this by grouping objects and limiting inter‑OSD communication.
The data addressing process involves three mappings: File → Object (splitting files into fixed‑size objects), Object → PG (hashing object IDs to PG IDs), and PG → OSD (using CRUSH to select N OSDs for each PG, typically with at least two replicas).
Proper sizing of PGs and sufficient numbers of OSDs (tens to hundreds) are crucial for balanced data distribution and system performance.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.