Mastering Ceph: Core Architecture, Data Flow, and Easy CephFS Deployment
This article provides a comprehensive overview of Ceph's distributed storage architecture, explains the CRUSH algorithm, data placement, OSD, monitor, and MDS components, and offers step‑by‑step instructions for installing and configuring a basic CephFS cluster.
Overview
Ceph is a distributed storage system created in 2004, originally aimed at building a next‑generation high‑performance distributed file system. With the rise of cloud computing, Ceph gained popularity as a key OpenStack backend.
CRUSH algorithm
The CRUSH algorithm replaces traditional centralized metadata addressing, using consistent hashing with fault‑domain awareness to place replicas across racks, rooms, or data centers, and can scale to thousands of storage nodes.
High availability
Administrators define the number of data replicas, and CRUSH determines their physical locations to isolate failure domains, ensuring strong consistency and automatic parallel recovery.
High scalability
Ceph has no central control node; as the cluster grows, performance scales linearly with the number of disks because there is no single proxy bottleneck.
Rich features
Ceph supports three access interfaces: Object storage, Block storage, and Filesystem mount. All three can be used simultaneously, and many cloud environments use Ceph as the sole OpenStack backend.
Ceph Basic Structure
Basic components diagram
At the bottom lies RADOS, the core distributed storage layer written in C++. Clients use the native Librados API (C/C++) to communicate with the cluster via sockets.
RADOS Gateway (RGW) provides S3/Swift‑compatible RESTful APIs, while RBD offers a block‑device interface commonly used with KVM/QEMU. CEPHFS supplies a POSIX kernel‑mode filesystem mount.
Ceph Core Components
OSD
Stores all data and objects, handles replication, recovery, back‑filling, and rebalancing. Each OSD sends heartbeats and reports to monitors.
MDS (optional)
Provides metadata services for CephFS; not required unless the filesystem interface is used.
Monitor
Tracks cluster state, maintains the cluster map, and ensures data consistency across the cluster.
OSD Details
Data storage process
All data is split into objects (typically 2 MiB or 4 MiB). Each object receives a unique OID composed of a file ID (ino) and a chunk number (ono). Objects are placed into Placement Groups (PGs), which act like index buckets for efficient lookup and migration.
PG assignment is computed as pg_id = hash(oid) % num_pg. The number of PGs influences data distribution uniformity.
locator = object_name
obj_hash = hash(locator)
pg = obj_hash % num_pg
osds_for_pg = crush(pg) # returns a list of osds
primary = osds_for_pg[0]
replicas = osds_for_pg[1:]PGs are replicated according to the configured replica count and stored on different OSDs via CRUSH.
OSD journal
Each OSD maintains a journal (default 5 GiB) that buffers writes, similar to MySQL InnoDB logs. Using SSDs for journals improves performance.
Monitor Nodes
Monitors listen on TCP 6789, store the latest cluster map, and use the Paxos algorithm for consistency. Clients download the map, compute OSD locations via CRUSH, and communicate directly with OSDs.
Recommended Architecture
Separate public and cluster networks to balance client I/O and inter‑OSD traffic.
MDS (Metadata Server)
MDS is required only for CephFS; it caches metadata but stores it as objects on OSDs.
Simple CephFS Installation
Prepare password‑less SSH, synchronize hosts, disable firewalls, and install the ceph-deploy tool from the official repository. yum install -y ceph-deploy Create a working directory, generate a new cluster with node1 as the first monitor, and configure basic settings in ceph.conf (replica size, networks, etc.).
echo "osd pool default size = 4" >> ceph.conf
echo "osd_pool_default_min_size = 3" >> ceph.conf
echo "public network = 192.168.120.0/24" >> ceph.conf
echo "cluster network = 10.0.0.0/8" >> ceph.confDeploy monitors, OSDs, and MDS, create pools, and finally create the CephFS filesystem:
ceph-deploy mon create-initial
ceph-deploy osd prepare node2:/dev/sdb1 node3:/dev/sdb1 node4:/dev/sdb1
ceph-deploy osd activate node2:/dev/sdb1 node3:/dev/sdb1 node4:/dev/sdb1
ceph-deploy mds create node1
ceph osd pool create test1 256
ceph osd pool create test2 256
ceph fs new cephfs test2 test1Verify the cluster status with ceph -s; a HEALTH_OK indicates a successful deployment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
