How Ceph Maps Files to Objects, PGs, and OSDs: A Deep Dive into RADOS Architecture
This article explains Ceph's distributed storage design, detailing how clients use RADOS to map files to objects, objects to placement groups, and placement groups to OSDs, while covering component roles, metadata handling, Xattr limits, and the impact of PGs on OSD communication overhead.
Ceph, launched in 2004, is a unified distributed storage system built for high performance, reliability, and scalability.
When using the RADOS layer, a client obtains the ClusterMap from OSDs or Monitors, computes the location of an object locally, and then communicates directly with the responsible OSD, eliminating the need for a separate metadata server as long as the ClusterMap remains stable.
ClusterMap updates occur only when an OSD fails or the cluster size changes, events that are far less frequent than normal data accesses.
Key Components
Client : Deployed on Linux servers, it slices data and uses the CRUSH algorithm to locate objects for read/write operations.
OSD : Stores data, handles replication, recovery, back‑filling, and reports status to Monitors; typically one OSD per physical disk, using btrfs, xfs, or ext4.
Monitor : Tracks OSD health, maintains OSD, PG, and CRUSH maps.
Metadata Cluster : Manages file metadata and is required only for CephFS.
Logical Architecture
A Cluster consists of multiple Pools .
Each Pool contains several Placement Groups (PGs) and defines replica counts.
Files are split into multiple Objects .
Each Object maps to a PG; a PG can contain many Objects.
A PG maps to a set of OSDs, with one primary and the rest secondary.
Multiple PGs can map to the same OSD, and an OSD can host hundreds of PGs.
Introducing PGs decouples OSDs from individual objects, simplifying storage management and enabling dynamic object‑to‑OSD mapping.
Mapping Process
1. File → Object : The file is split into fixed‑size objects (commonly 2 MB or 4 MB), similar to RAID striping, allowing parallel processing of objects.
Transforms unlimited‑size files into uniformly sized objects manageable by RADOS.
Enables parallel handling of multiple objects instead of serial file processing.
2. Object → PG : Each object is hashed to determine its PG using the formula Hash(OID) & Mask → PGID.
3. PG → OSD : The CRUSH algorithm takes the PGID and selects N OSDs to store and maintain the objects belonging to that PG.
Operational Considerations
Without PGs, an OSD would need to exchange status information with every other OSD storing objects that share the same physical storage, leading to millions of messages per OSD—a prohibitive overhead.
OSD metadata is stored in filesystem extended attributes (Xattrs). ext4 provides only 4 KB, XFS up to 64 KB, while Btrfs has no practical limit but is less stable; production deployments typically prefer XFS or Btrfs.
The number of OSDs influences data distribution uniformity; a healthy Ceph cluster usually runs dozens to hundreds of OSDs, each belonging to multiple failure domains (rack, room, etc.) to ensure resilience.
Overall, understanding Ceph's logical architecture and the three‑step mapping process is essential for designing efficient, scalable storage solutions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
