Operations 7 min read

Recovering a Ceph 16 Cluster After System Disk Failure

This guide walks through the step‑by‑step process of restoring a Ceph 16 cluster when a node's system disk fails, covering host removal, node re‑initialization, Docker and Cephadm installation, host addition, labeling, OSD recreation, and final verification.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Recovering a Ceph 16 Cluster After System Disk Failure

This article explains how to recover a Ceph 16 cluster after a node's system disk fails. Even with RAID1, unexpected failures can bring down the MON, OSD, and MGR services on the failed node. If the MGR on that node was active, a standby node will take over.

Remove the Faulty Host

When a node cannot boot, remove it from the cluster from another healthy node:

ceph orch host rm node4 --offline --force

Re‑initialize the Node

Replace the failed system disk, reinstall the OS, rename the host (e.g., to node1), assign a new IP, and update /etc/hosts on all three Ceph nodes:

192.168.1.1 node1
192.168.1.2 node2
192.168.1.3 node3

Add the Ceph public key to the new host:

ssh-copy-id -f -i /etc/ceph/ceph.pub node1

Install Docker

curl -sSL https://get.daocloud.io/docker | sh
systemctl daemon-reload
systemctl restart docker
systemctl enable docker

Install cephadm and ceph‑common

# curl --silent --remote-name --location https://github.com/ceph/ceph/raw/pacific/src/cephadm/cephadm
# chmod +x cephadm
# ./cephadm add-repo --release pacific
# ./cephadm install
# ./cephadm install ceph-common

Add the New Host to the Cluster

ceph orch host add node1

Verify the host list: ceph orch host ls The host will receive MON and crash services automatically, but it cannot manage the cluster until it has the admin keyring. Add the special _admin label to the host so cephadm distributes ceph.conf and the admin keyring:

ceph orch host label add node1 _admin
# or during addition
ceph orch host add node1 --labels=_admin

Create and Activate a New OSD

Create an empty OSD (returns ID 2):

# vceph osd create
2

Activate the Bluestore tmpfs directory for the OSD: ceph-volume lvm activate (osdid) (fsid) Add authentication and crush map, then start the OSD:

ceph auth add osd.2 osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-2/keyring

If the OSD daemon is not managed by cephadm, delete the OSD, format the underlying disk, and re‑add it.

# View OSD container ID (optional)
ceph orch ps --daemon_type osd
# Remove OSD from cluster
ceph osd out 2
ceph osd crush remove osd.2
ceph auth del osd.2
ceph osd rm 2
# Clean up device mappings
# dmsetup status
# dmsetup remove_all
# Format the disk
mkfs -t ext4 /dev/vdb

Re‑add the OSD to the cluster:

ceph orch daemon add osd node1:/dev/vdb

After these steps the Ceph cluster returns to normal operation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsstorageCephCluster RecoverySystem Disk
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.