Master Ceph Cluster Management: Fix Nearfull OSD, PG States & Config Commands
This guide explains how to troubleshoot Ceph near‑full OSD warnings, understand PG fault states, manage OSD and monitor statuses, perform cluster configuration changes without restarts, add or remove OSDs and monitors, adjust pool settings, and handle user permissions using detailed command examples.
Common Issues
When nearfull osd(s) or pool(s) nearfull appear, it indicates that some OSDs have exceeded the usage threshold. Adjusting mon_osd_full_ratio and mon_osd_nearfull_ratio in the configuration can raise the thresholds, but the root cause often requires examining OSD data distribution.
"mon_osd_full_ratio": "0.95",
"mon_osd_nearfull_ratio": "0.85"Automatic Handling
ceph osd reweight --by-utilization
ceph osd reweight -pg 105 cephfs_data (pool_name)Manual Handling
ceph osd reweight osd.20 0.8Global Handling
ceph mgr module ls
ceph mgr module enable balancer
ceph balancer on
ceph balancer mode crush-compat
ceph config-key set "mgr/balancer/max_misplaced" "0.01"PG Fault States
PGs can be in various states during their lifecycle:
Creating – PG is being created when a pool is defined.
Peering – OSDs establish connections and achieve consistency for objects.
Active – Data is fully stored on primary and replica PGs.
Clean – All replicas are in sync and no PGs are out‑of‑place.
Degraded – Replicas are missing; caused by OSD down or write failures.
Recovering – OSDs that were down are catching up after coming back up.
Backfilling – New OSDs receive re‑assigned PGs.
Remapped – PGs are migrating to a new acting set.
Stale – Monitor has not received recent reports from the PG’s acting set.
OSD Status
OSDs have two independent status groups: in/out (membership in the cluster) and up/down (process health). Combinations include:
in & up – Normal, OSD is part of the cluster and running.
in & down – OSD is still a member but not running; after a timeout it becomes out.
out & up – OSD is running but not yet added to the cluster (e.g., newly added).
out & down – OSD is removed and not running; CRUSH will not place PGs on it.
Cluster Monitoring Management
# ceph -s
cluster:
id: 8230a918-a0de-4784-9ab8-cd2a2b8671d0
health: HEALTH_WARN
services:
mon: 3 daemons, quorum cephnode01, cephnode02, cephnode03 (age 27h)
mgr: cephnode01 (active, since 53m), standbys: cephnode03, cephnode02
osd: 4 osds: 4 up (since 27h), 4 in (since 19h)
rgw: 1 daemon active (cephnode01)
data:
pools: 6 pools, 96 pgs
objects: 235 objects, 3.6 KiB
usage: 4.0 GiB used, 56 GiB / 60 GiB avail
pgs: 96 active+cleanCluster Configuration Management (Temporary and Global, Service Smooth Restart)
Use ceph daemon {daemon-type}.{id} config show to view running configuration without restarting services.
# ceph daemon osd.0 config showtell Command Format
The tell sub‑command applies settings cluster‑wide using a wildcard ( *) to match all daemons.
# ceph tell {daemon-type}.{daemon id or *} injectargs --{name}={value} [--{name}={value}]
# ceph tell osd.0 injectargs --debug-osd 20 --debug-ms 1daemon Sub‑command
Set configuration on a specific daemon directly.
# ceph daemon mon.ceph-monitor-1 config set mon_allow_pool_delete falseCluster Operations
# systemctl start ceph.target
# systemctl start ceph-mgr.target
# systemctl start ceph-osd@id
# systemctl start ceph-mon.target
# systemctl start ceph-mds.target
# systemctl start ceph-radosgw.targetAdd and Delete OSD
Adding an OSD:
# ceph volume lvm zap /dev/sd<id>
# ceph-deploy osd create --data /dev/sd<id> $hostnameRemoving an OSD:
# ceph osd crush reweight osd.<ID> 0.0
# systemctl stop ceph-osd@<ID>
# ceph osd out <ID>
# ceph osd purge osd.<ID> --yes-i-really-mean-it
# umount /var/lib/ceph/osd/ceph-?Expand PG
# ceph osd pool set {pool-name} pg_num 128
# ceph osd pool set {pool-name} pgp_num 128Note: pg_num and pgp_num should be increased together and chosen as a power of two.
Pool Operations
# ceph osd lspools
# ceph osd pool create {pool-name} {pg-num} [{pgp-num}]
# ceph osd pool set-quota {pool-name} max_objects 10000
# ceph osd pool delete {pool-name} {pool-name} --yes-i-really-mean-it
# ceph osd pool rename {current-pool-name} {new-pool-name}
# rados df
# ceph osd pool mksnap {pool-name} {snap-name}
# ceph osd pool rmsnap {pool-name} {snap-name}
# ceph osd pool get {pool-name} {key}
# ceph osd pool set {pool-name} {key} {value}User Management
Ceph users need permissions to access pools and execute management commands.
# ceph auth list
# ceph auth get client.admin
# ceph auth print-key client.admin
# ceph auth add client.john mon 'allow r' osd 'allow rw pool=liverpool'
# ceph auth caps client.john mon 'allow r' osd 'allow rw pool=liverpool'
# ceph auth del client.johnAdd and Delete Monitor
# ceph-deploy mon create $hostname
# ceph-deploy mon destroy $hostnameIt is recommended to run an odd number of monitors (at least three in production) to maintain quorum.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
