Recovering a Ceph 16 Cluster After System Disk Failure
This guide walks through the step‑by‑step process of restoring a Ceph 16 cluster when a node's system disk fails, covering host removal, node re‑initialization, Docker and Cephadm installation, host addition, labeling, OSD recreation, and final verification.
This article explains how to recover a Ceph 16 cluster after a node's system disk fails. Even with RAID1, unexpected failures can bring down the MON, OSD, and MGR services on the failed node. If the MGR on that node was active, a standby node will take over.
Remove the Faulty Host
When a node cannot boot, remove it from the cluster from another healthy node:
ceph orch host rm node4 --offline --forceRe‑initialize the Node
Replace the failed system disk, reinstall the OS, rename the host (e.g., to node1), assign a new IP, and update /etc/hosts on all three Ceph nodes:
192.168.1.1 node1
192.168.1.2 node2
192.168.1.3 node3Add the Ceph public key to the new host:
ssh-copy-id -f -i /etc/ceph/ceph.pub node1Install Docker
curl -sSL https://get.daocloud.io/docker | sh
systemctl daemon-reload
systemctl restart docker
systemctl enable dockerInstall cephadm and ceph‑common
# curl --silent --remote-name --location https://github.com/ceph/ceph/raw/pacific/src/cephadm/cephadm
# chmod +x cephadm
# ./cephadm add-repo --release pacific
# ./cephadm install
# ./cephadm install ceph-commonAdd the New Host to the Cluster
ceph orch host add node1Verify the host list: ceph orch host ls The host will receive MON and crash services automatically, but it cannot manage the cluster until it has the admin keyring. Add the special _admin label to the host so cephadm distributes ceph.conf and the admin keyring:
ceph orch host label add node1 _admin
# or during addition
ceph orch host add node1 --labels=_adminCreate and Activate a New OSD
Create an empty OSD (returns ID 2):
# vceph osd create
2Activate the Bluestore tmpfs directory for the OSD: ceph-volume lvm activate (osdid) (fsid) Add authentication and crush map, then start the OSD:
ceph auth add osd.2 osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-2/keyringIf the OSD daemon is not managed by cephadm, delete the OSD, format the underlying disk, and re‑add it.
# View OSD container ID (optional)
ceph orch ps --daemon_type osd
# Remove OSD from cluster
ceph osd out 2
ceph osd crush remove osd.2
ceph auth del osd.2
ceph osd rm 2
# Clean up device mappings
# dmsetup status
# dmsetup remove_all
# Format the disk
mkfs -t ext4 /dev/vdbRe‑add the OSD to the cluster:
ceph orch daemon add osd node1:/dev/vdbAfter these steps the Ceph cluster returns to normal operation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
