Operations 16 min read

How I Recovered a Crashed Ceph Cluster: A Complete Rescue Guide

This guide walks through Ceph’s architecture, deployment with cephadm, hardware selection, common failure scenarios, and practical performance tuning steps, offering concrete commands and best‑practice recommendations to keep a Ceph cluster stable and efficient.

AI Agent Super App
AI Agent Super App
AI Agent Super App
How I Recovered a Crashed Ceph Cluster: A Complete Rescue Guide

How Ceph Works

Ceph aggregates the disks of many servers into a single logical storage pool. Example: ten machines with eight drives each appear as a 640 TB pool that simultaneously provides block, file, and object interfaces.

Four‑layer architecture

Bottom layer – RADOS : foundation where all data resides, composed of OSDs (Object Storage Daemons) and MONs (Monitors).

Middle layer – LIBRADOS : library used by applications (C/C++/Python) to talk to RADOS.

Top services : RGW (object gateway, S3/Swift compatible), RBD (block device for VMs), CephFS (POSIX‑compatible file system).

Management components : MGR (statistics and monitoring) and MDS (metadata server, required only for CephFS).

CRUSH algorithm – locating data

Client computes placement in two steps:

Object → Placement Group (PG): hash‑modulo maps the object to a PG.

PG → OSDs: CRUSH calculates which OSDs should store the PG.

PG count formula: PG_total = (OSD_count × 100) / replica_count. Example: 24 OSDs with 3 replicas → 800 PGs.

Data write flow

Client splits the file into 4 MiB objects.

Each object is hashed to a PG.

CRUSH selects three OSDs (assuming 3‑way replication).

Client writes directly to the primary OSD, which replicates to the two secondary OSDs.

Clients communicate directly with OSDs without a proxy or gateway, which enables high performance.

Deploying a Ceph Cluster with cephadm

Environment preparation

3 servers, each with at least 8 GB RAM and 2 data disks.

OS: Ubuntu 22.04 or CentOS Stream 9.

Network: minimum 10 GbE, front‑end and back‑end networks separated.

Time sync: configure chrony or NTP on all nodes.

Kernel optimisation

# /etc/sysctl.conf optimisation
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 87380 33554432
net.core.somaxconn = 4096
vm.swappiness = 10
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10
fs.file-max = 2000000
sysctl -p

Bootstrap the cluster

# Install cephadm
apt install -y cephadm
# Bootstrap cluster (run on first MON node)
cephadm bootstrap --mon-ip 192.168.1.10
# Add other hosts
ceph orch host add node2 192.168.1.11
ceph orch host add node3 192.168.1.12
# Add OSDs
ceph orch daemon add osd node1:/dev/sdb
ceph orch daemon add osd node2:/dev/sdb

Network separation configuration

[global]
public_network = 192.168.1.0/24
cluster_network = 10.0.10.0/24

Server hardware selection guide

OSD node core requirements

CPU : at least 2 × 2 GHz cores per OSD; NVMe drives need 4 cores per drive. Recommended: Intel Xeon Silver or AMD EPYC.

Memory : minimum 4 GB per OSD (BlueStore). Official baseline: 16 GB + 5 GB per OSD. A node with 24 OSDs should start with 128 GB.

Data disks :

Capacity‑oriented: enterprise SATA/SAS HDD (7200 RPM) – e.g., Seagate Exos, Western Digital Ultrastar.

Performance‑oriented: enterprise NVMe SSD – e.g., Intel, Samsung, Micron data‑center models.

Never use consumer SSDs (e.g., Samsung 870 EVO); DWPD is insufficient and they fail within months.

Cache/DB disk configuration (BlueStore)

block.db : stores metadata; size 1‑4 % of the data‑disk capacity (1 % enough for RBD, 4 % for object storage).

block.wal : write‑ahead log; 10‑30 GB is sufficient and worthwhile only when the WAL device is faster than the DB device.

A 7.68 TB NVMe can be partitioned and shared among multiple OSDs for DB/WAL.

Network requirements

Minimum: 10 GbE; 1 GbE must not be used in production.

Recommended: 25 GbE or higher, with dual‑NIC bonding.

Front‑end network (public) for client traffic; back‑end network (cluster) for OSD replication/recovery.

Backend bandwidth should be ≥ frontend bandwidth × replica count (e.g., at least 3× for 3‑way replication).

MON nodes

CPU: 4 cores.

Memory: 4‑8 GB.

System disk: SSD in RAID 1.

Quantity: 3 or 5 (odd number for quorum).

Daily operations – fault diagnosis

Quick‑diagnosis command checklist

ceph -s               # cluster status overview
ceph osd df           # OSD capacity usage
ceph osd perf         # OSD performance metrics
ceph pg stat          # PG statistics
ceph pg dump_stuck unclean # stuck PGs
ceph -w               # real‑time monitoring
ceph progress         # recovery progress

Handling OSD down

Check logs: journalctl -u ceph-osd@X -n 100 Check disk health: smartctl -a /dev/sdX If the process crashed: systemctl restart ceph-osd@X If the disk failed: follow the replacement procedure.

Full disk replacement procedure

ceph osd out osd.5
ceph osd down osd.5
systemctl stop ceph-osd@5
# Wait for data recovery to finish
ceph -w
# Remove old OSD
ceph osd rm osd.5
ceph osd crush rm osd.5
ceph auth del osd.5
# Add new disk and redeploy
ceph orch daemon add osd node2:/dev/sdf

Pause recovery before replacement with ceph osd set norecovery and resume after the new disk is in place.

MON quorum loss

Stop all Ceph services: systemctl stop ceph.target.

Start the remaining MONs: systemctl start ceph-mon@hostname.

If a MON’s store.db is corrupted, copy it from another MON.

As a last resort, rebuild the MON database from OSDs.

Backup MON maps regularly: ceph mon getmap -o /backup/mon.map.

PG inconsistency

Shallow repair: ceph pg repair PG_ID.

Deep scrub: ceph pg deep-scrub PG_ID (verifies each object’s checksum).

If repair fails, manually delete corrupted replicas so Ceph can re‑replicate from healthy copies.

Performance tuning in practice

OSD parameter optimisation

ceph config set osd osd_memory_target 4G
ceph config set osd bluestore_cache_size_ssd 3G
ceph config set osd bluestore_cache_size_hdd 1G
ceph config set client rbd_cache_size 512M
ceph config set client rbd_cache_max_dirty 256M
ceph config set client rbd_cache_target_dirty 128M
ceph config set osd osd_recovery_max_active 5
ceph config set osd osd_recovery_sleep 0

Pool planning

# High‑performance pool
ceph osd pool create rbd_pool 128 128
ceph osd pool set rbd_pool size 3
ceph osd pool set rbd_pool min_size 2
# Erasure‑coded pool (space saving)
ceph osd erasure-code-profile set ec42 k=4 m=2
ceph osd pool create ec_pool 128 128 erasure ec42
# PG total = (OSD count × 100) / replica count
# Example: 24 OSDs, 3 replicas → PG = 24*100/3 = 800

Network layer optimisation

Enable Jumbo Frames (MTU 9000) to reduce CPU interrupt overhead.

Activate multi‑queue on NICs: ethtool -L eth0 combined 8.

Bind OSD threads to specific CPU cores to avoid cross‑core migration.

Disable NIC offload features (TSO/GSO/GRO) on drivers where they degrade performance.

Disk I/O optimisation

IO scheduler: mq-deadline for HDDs, none for SSD/NVMe.

Disable disk power‑saving: hdparm -B 255 /dev/sdX.

Increase readahead: blockdev --setra 4096 /dev/sdX.

Ensure firmware is up‑to‑date; older firmware may contain performance bugs.

Key monitoring metrics

OSD apply latency > 50 ms – investigate immediately.

Stuck inactive or stuck unclean PGs – address without delay.

Recovery throttling indicated by recovery_toofull.

Capacity watermarks: nearfull (85 %) warning, full (95 %) blocks writes.

Backend network utilisation > 80 % – plan expansion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance TuningtroubleshootingDistributed StorageCephOSDHardware SelectionCRUSH
AI Agent Super App
Written by

AI Agent Super App

AI agent applications, installation, large-model testing, computer fundamentals, IT operations and maintenance exchange, network technology exchange, Linux learning

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.