Operations 5 min read

Master Ceph: Essential Operations Guide for Storage Engineers

This free Ceph operations manual outlines common administrative tasks, troubleshooting techniques, and advanced topics, providing storage engineers with a comprehensive reference for managing, monitoring, and optimizing Ceph clusters in production environments.

Go Development Architecture Practice

Aug 27, 2025

Master Ceph: Essential Operations Guide for Storage Engineers

Ceph Overview

Ceph is an open‑source, self‑healing and self‑managing distributed storage system written in C++. It is widely used as a core storage technology in modern data‑center and cloud environments.

Common Administrative Operations

Start, stop, and restart Ceph daemons (MON, OSD, MDS, RGW) using systemctl or the ceph CLI.

Monitor cluster health and status with ceph health, ceph -s, and the dashboard.

Manage users and authentication keys via ceph auth (create, delete, modify caps).

Add or remove MON nodes: update the monmap, adjust ceph.conf, and run ceph mon add or ceph mon remove.

Add or remove OSDs: prepare disks with ceph-volume lvm create or ceph-deploy osd create, then start the OSD daemon; remove OSDs with ceph osd out followed by ceph osd crush remove and ceph osd rm.

Create, modify, and delete storage pools using ceph osd pool create, ceph osd pool delete, and ceph osd pool set for parameters such as size, min_size, pg_num, and pgp_num.

Update cluster configuration files ( ceph.conf) and apply changes with ceph daemon or by restarting affected daemons.

Manage the CRUSH map: export with ceph osd getcrushmap -o crushmap.bin, edit (e.g., with crushtool), and inject the new map using ceph osd setcrushmap -i crushmap.bin.

Change monitor IP addresses: edit the monitor’s entry in ceph.conf, update the monmap, and restart the monitor daemon.

Fault Diagnosis and Recovery

The manual groups typical failure scenarios and provides step‑by‑step remediation:

OSD down or out: identify affected OSDs with ceph health detail, bring the OSD back online, or replace failed disks and re‑add the OSD.

MON quorum loss: verify network connectivity, ensure each monitor’s monmap is consistent, and restart missing monitors to restore quorum.

PG (placement group) stuck or degraded: use ceph pg dump to locate problematic PGs, trigger recovery with ceph pg repair, and adjust pg_num / pgp_num if needed.

CRUSH rule errors: validate the rule syntax with crushtool --test and re‑apply a corrected CRUSH map.

Authentication failures: check user caps, regenerate keys, and propagate updated keys to client hosts.

Advanced Configuration and Cloud‑Native Integration

Beyond basic operations, the guide covers deeper tuning and integration topics:

Performance tuning: adjust osd_journal_size, filestore vs bluestore settings, and network parameters such as ms_tcp_front_sync for low‑latency environments.

Custom CRUSH rules for heterogeneous hardware (e.g., SSD vs HDD tiers) and for multi‑site replication.

Integration with container orchestration platforms (Kubernetes, OpenShift) using the Ceph CSI driver, Rook operator, and Helm charts.

Service‑mesh compatibility: expose Ceph services via Envoy or Istio sidecars to enable secure, observable traffic between Ceph components and micro‑services.

Automation scripts and Ansible playbooks for repeatable cluster deployment and configuration drift detection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Troubleshooting distributed storage Ceph Storage Management

Written by

Go Development Architecture Practice

Daily sharing of Golang-related technical articles, practical resources, language news, tutorials, real-world projects, and more. Looking forward to growing together. Let's go!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.