Unlocking Ceph: How Distributed Storage Powers Modern Cloud Infrastructures
This article explains the fundamentals of Ceph, a high‑performance, highly available and scalable distributed storage system, covering its architecture, core components, data placement algorithms, storage interfaces, and typical deployment scenarios in cloud environments.
Ceph Overview
What is distributed storage? Imagine many servers, each with multiple disks, combined by software into a single logical storage pool. Users access this pool through a unified interface; files are split into small chunks and stored across different servers and disks, providing redundancy and fault tolerance.
Ceph is a unified, distributed file system designed for performance, reliability, and scalability. It offers file, block, and object storage from a single system and can dynamically expand. Many Chinese cloud providers use Ceph as the sole backend for OpenStack to improve data transfer efficiency.
Ceph originated from the doctoral research of Sage (first results published in 2004) and was later contributed to the open‑source community. After years of development, it is now supported by many cloud vendors; Red Hat and OpenStack integrate with Ceph for virtual‑machine image storage.
Official site: https://ceph.com/
Documentation: http://docs.ceph.org.cn/rados/
Ceph Features
High Performance
Uses the CRUSH algorithm instead of centralized metadata lookup, achieving balanced data distribution and high parallelism.
Considers fault‑domain isolation, allowing replica placement rules across rooms, racks, etc.
Scales to thousands of storage nodes, handling TB to PB‑level data.
High Availability
Replica count is flexible (typically three copies in production).
Supports fault‑domain separation and strong data consistency.
Automatically repairs various failure scenarios.
No single point of failure; the system automatically restores missing replicas.
High Scalability
Decentralized architecture.
Flexible expansion.
Linear growth as nodes are added.
Rich Features
Supports three storage interfaces: block (raw disks), file (POSIX directories), and object (key‑value storage).
Customizable interfaces and multi‑language drivers.
Ceph Application Scenarios
Ceph provides object storage, block device storage, and file system services. Its object storage can back cloud‑drive applications (e.g., ownCloud). Its block storage integrates with IaaS platforms such as OpenStack, CloudStack, ZStack, Eucalyptus, and KVM.
Ceph offers three main functions:
Object Storage (RADOSGW) : RESTful API, compatible with S3 and Swift.
Block Storage (RBD) : Provides virtual disks with built‑in disaster‑recovery.
File System (CephFS) : POSIX‑compatible network file system for high‑performance, large‑capacity storage.
What are block, object, and file system storage?
Object storage : Key‑value store with simple GET/PUT/DELETE APIs (e.g., Swift, S3).
Block storage : Exposes a block‑device interface (e.g., Linux kernel block device, QEMU driver) such as RBD, EBS, etc.
File system storage : POSIX‑compatible interface (e.g., CephFS, GlusterFS, HDFS, NFS, NAS).
Ceph Core Components
Monitors (MON) : Maintain cluster maps, provide authentication and logging.
Metadata Server (MDS) : Stores metadata for CephFS (not needed for block or object storage).
OSD (Object Storage Daemon) : Runs on each disk, stores data as objects, handles replication, recovery, back‑filling, and rebalancing.
RADOS : Reliable Autonomic Distributed Object Store, the foundation layer that stores all objects.
librados : Library offering native APIs for applications.
RADOSGW : Gateway providing S3/Swift‑compatible RESTful object storage.
RBD : Block device interface built on top of RADOS.
CephFS : POSIX‑compatible file system built on librados.
Ceph Logical Layer Structure
RADOS System Logical Structure
Ceph Data Storage Process
How a File Is Stored and Retrieved in Ceph
When a user uploads a file, Ceph splits it into equal‑sized objects. Each object is hashed and placed into a Placement Group (PG), which is then mapped to one or more OSDs.
All storage types (object, block, file) break data into objects of configurable size (typically 2 MiB or 4 MiB). Each object receives a unique OID composed of the file ID (ino) and the object number (ono).
Example: File ID A split into two objects yields OIDs A0 and A1.
Ceph Logical Mapping Layers
File → Object mapping.
Object → PG mapping using hash(oid) & mask → pgid.
PG → OSD mapping via the CRUSH algorithm.
CRUSH (Controlled Replication Under Scalable Hashing) replaces metadata tables with a deterministic algorithm that computes data placement, understands the cluster topology, and creates multiple replicas for fault tolerance, enabling self‑management and self‑healing.
RADOS Advantages Over Traditional Distributed Storage
Maps files to objects and uses CRUSH to locate data, avoiding block‑map lookups.
Leverages OSD intelligence to maximize scalability.
Ceph I/O Flow and Data Distribution
Normal I/O Flow
Steps:
Client creates a cluster handler.
Client reads the configuration file.
Client connects to monitors to obtain the cluster map.
Client issues I/O requests; CRUSH determines the primary OSD.
Primary OSD writes data to two replica OSDs.
Client waits for acknowledgments from primary and replicas.
After successful writes, the client receives completion.
New Primary I/O Flow
When a new OSD replaces a failed primary, it initially has no PG data. The former primary temporarily takes over, syncs data to the new OSD, and after synchronization the new OSD becomes primary.
Ceph Pool and PG Distribution
A pool is a logical namespace that contains a configurable number of PGs. Objects within PGs are mapped to OSDs across the cluster. Pools can be used for fault‑domain isolation based on different user scenarios.
Source: https://www.cnblogs.com/shuaiyin/p/11037909.html
Conclusion
If you found this article helpful, feel free to read it again or share it with others.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
