How to Expand a Ceph Cluster Without Overloading Your Services
This guide walks through a real‑world Ceph cluster expansion from 500 TB to 1.2 PB, explaining the risks of automatic rebalancing, presenting a step‑by‑step, batch‑wise expansion plan with weight‑adjustment tricks, configuration tuning, monitoring, troubleshooting, and rollback procedures to keep business latency under control.
Overview
When a Ceph cluster grows, the automatic data rebalancing triggered by adding or removing OSDs can cause severe latency spikes and even outage risk. A controlled expansion process that limits the impact of rebalancing on front‑end services is essential.
Technical characteristics
CRUSH algorithm driven : Adding or removing OSDs forces CRUSH to recompute data placement, causing massive data migration.
Background automatic execution : By default Ceph rebalances as fast as possible, ignoring business impact.
Limited controllability : Recovery speed can be throttled, but pausing rebalancing completely leads to uneven data distribution.
Accumulated effect : Adding many OSDs at once multiplies the migration load.
Applicable scenarios
Production‑grade capacity expansion without service disruption.
Hardware refresh where old OSDs are replaced gradually.
Performance optimisation by re‑balancing OSD distribution.
Data‑centre migration or multi‑site deployment involving large‑scale data reshuffling.
Environment requirements
Ceph version: Quincy (17.x) or Reef (18.x) – the guide assumes Quincy.
OS: Ubuntu 22.04 or Rocky Linux 9 with kernel 5.4+ for optimal BlueStore performance.
Network: Minimum 10 GbE, 25 GbE recommended; separate public and cluster networks.
Memory: ≥8 GB per OSD for BlueStore cache.
Step‑by‑step procedure
1. Preparation
1.1 Check cluster health and status
# View overall status
ceph status
ceph health detail
# Verify OSD count and distribution
ceph osd tree
# Show data distribution
ceph osd df tree
# Cluster usage
ceph df detail
# PG status
ceph pg stat
ceph pg dump_stuckKey indicators:
Cluster health must be HEALTH_OK.
No PG in degraded, recovering or backfilling state.
No OSD marked down.
Overall usage below 70 % to leave buffer for rebalancing.
1.2 Plan expansion batches
# Example: 100 existing OSDs, add 20 new OSDs in 2‑3 batches
# Batch 1: add 8 OSDs
# Batch 2: add 8 OSDs
# Batch 3: add 4 OSDs
# Estimated data moved for first batch (500 TB * 8/108 ≈ 37 TB)1.3 Adjust rebalancing parameters before adding OSDs
# Throttle recovery speed
ceph config set osd osd_recovery_max_active 1
ceph config set osd osd_recovery_max_active_hdd 1
ceph config set osd osd_recovery_max_active_ssd 2
ceph config set osd osd_max_backfills 1
ceph config set osd osd_recovery_sleep_hdd 0.1
ceph config set osd osd_recovery_sleep_ssd 0
ceph config set osd osd_recovery_priority 12. Core configuration
2.1 Add OSD with initial weight 0
# When adding a new OSD
ceph orch daemon add osd host:device --crush-initial-weight=0
# Or adjust weight immediately after addition
ceph osd crush reweight osd.ID 0Weight 0 prevents the new OSD from receiving data until we are ready.
2.2 Gradually increase OSD weight
# Example loop to raise weight in steps of 0.1
for i in {100..107}; do
ceph osd crush reweight osd.$i 0.1
done
# Wait for rebalancing to finish before the next step
watch -n 5 'ceph status | grep -E "recovery|backfill|misplaced"'
# Continue with 0.3, 0.5, … up to 1.0Increase weight by no more than 0.2 per step to avoid sudden data storms.
Wait for rebalancing to complete before the next increment.
During low‑traffic windows you may use larger steps.
2.3 Use the noout flag during expansion
# Prevent temporary network glitches from marking OSDs out
ceph osd set noout
# After expansion
ceph osd unset noout
# Optionally target a single OSD
ceph osd add-noout osd.100
ceph osd rm-noout osd.1003. Monitoring
3.1 Rebalancing progress
# Real‑time status
watch -n 2 'ceph status'
# Detailed progress
ceph progress
# PG recovery state
ceph pg stat
ceph pg dump_stuck recovering3.2 Business impact
# OSD latency
ceph osd perf
# IOPS and throughput
ceph daemon osd.0 perf dump | jq '.osd.op_r, .osd.op_w'
# Optional stress test (use with caution)
# rados bench -p test-pool 60 write --no-cleanup4. Example scripts
4.1 Pre‑expansion configuration script (ceph_expansion_config.sh)
#!/bin/bash
# Limit recovery concurrency and bandwidth
ceph config set osd osd_recovery_max_active 1
ceph config set osd osd_recovery_max_active_hdd 1
ceph config set osd osd_recovery_max_active_ssd 2
ceph config set osd osd_max_backfills 1
ceph config set osd osd_recovery_priority 1
ceph config set osd osd_recovery_sleep_hdd 0.1
ceph config set osd osd_recovery_sleep_ssd 0.02
ceph config set osd osd_recovery_max_bytes 52428800 # 50 MB/s
ceph osd set noout
ceph config dump | grep -E "recovery|backfill"
ceph status4.2 Gradual weight‑adjustment script (gradual_weight_adjust.sh)
#!/bin/bash
OSD_IDS="100 101 102 103 104 105 106 107"
TARGET_WEIGHT=1.0
STEP=0.2
CHECK_INTERVAL=30
MAX_WAIT_TIME=3600
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"; }
check_rebalance_active(){
status=$(ceph status 2>/dev/null)
echo "$status" | grep -qE "recovery|backfill|misplaced|degraded"
}
wait_for_rebalance(){
waited=0
log "Waiting for rebalancing to finish..."
while check_rebalance_active; do
[ $waited -ge $MAX_WAIT_TIME ] && { log "Timeout waiting for rebalance"; return 1; }
sleep $CHECK_INTERVAL
waited=$((waited+CHECK_INTERVAL))
progress=$(ceph progress json 2>/dev/null | jq -r '.events[0].progress // 0')
log "Rebalance progress: ${progress}%, waited: ${waited}s"
done
log "Rebalance completed"
}
main(){
log "Starting gradual OSD weight adjustment"
rounds=$(echo "scale=0; $TARGET_WEIGHT/$STEP" | bc)
log "Estimated $rounds rounds"
for round in $(seq 1 $rounds); do
new_weight=$(echo "scale=1; $STEP*$round" | bc)
log "=== Round $round: setting weight to $new_weight ==="
for osd_id in $OSD_IDS; do
ceph osd crush reweight osd.$osd_id $new_weight
done
sleep 10
wait_for_rebalance || log "Warning: round $round timed out"
log "Round $round completed"
done
ceph osd df tree
}
main5. Real‑world cases
Case 1: 500 TB → 800 TB
60 existing OSDs (8 TB each) + 36 new OSDs.
Batch‑wise addition with weight steps 0.1 → 1.0.
Total time ~5 days, P99 latency rose from 50 ms to 80 ms (acceptable).
Data migrated ≈240 TB, no alerts or user complaints.
Case 2: Emergency expansion (85 % usage, 48 h window)
Set aggressive recovery limits (max_active = 3, backfills = 2, bandwidth = 100 MB/s).
Add all OSDs at once with initial weight 0, then raise to 0.5 at 02:00 and to 1.0 at 05:00.
Rapid expansion while monitoring latency spikes.
6. Best practices & caveats
Keep cluster usage below 70 % before expansion.
Distribute new OSDs across different hosts/racks to avoid single‑failure domains.
Separate public and cluster networks; run rebalancing traffic on the cluster network.
Perform weight adjustments during low‑traffic windows; consider higher speed at night.
Never start expansion when health is not HEALTH_OK.
Avoid mixing other configuration changes (e.g., CRUSH rule edits) with expansion.
Ensure new OSD hardware matches existing performance characteristics.
7. Troubleshooting
OSD added but no data movement : Verify weight > 0, OSD is up+in, and CRUSH map includes the OSD.
Slow rebalancing : Recovery limits may be too conservative; increase them within business tolerance.
Latency spikes : Rebalancing is consuming IO; lower recovery parameters or temporarily set norecover / nobackfill.
PG stuck in remapped : Check CRUSH rules and weight settings; use ceph osd reweight-by-utilization to rebalance.
Emergency stop‑gap:
# Pause all recovery/backfill
ceph osd set norecover
ceph osd set nobackfill
# After investigation, resume
ceph osd unset norecover
ceph osd unset nobackfill
# Optionally set problematic OSD weight to 0
ceph osd crush reweight osd.ID 08. Monitoring & alerts
Key metrics to watch (via ceph status, ceph osd perf, or Prometheus):
Read latency op_r_latency (<10 ms SSD, <50 ms HDD).
Write latency op_w_latency (<20 ms SSD, <100 ms HDD).
Recovery ops count.
Misplaced objects – should rise during expansion but eventually drop.
OSD apply latency (<50 ms normal, >100 ms warning).
Example Prometheus alert for slow recovery:
groups:
- name: ceph_expansion_alerts
rules:
- alert: CephRecoveryTooSlow
expr: rate(ceph_pg_recovery_bytes[5m]) < 10485760 and ceph_pg_total - ceph_pg_active > 0
for: 30m
labels:
severity: warning
annotations:
summary: "Ceph recovery speed too slow"
description: "Recovery below 10 MB/s for 30 minutes"9. Pre‑expansion checklist & rollback
Run a script that verifies health, PG state, OSD status, and usage (<70 %). Backup the current CRUSH map before any changes.
If a severe issue occurs, follow these steps:
Pause recovery: ceph osd set norecover and ceph osd set nobackfill.
Set all newly added OSD weights to 0.
Optionally remove the new OSDs (out, crush remove, rm, auth del).
Restore the previous CRUSH map with ceph osd setcrushmap -i /backup/crushmap.bin.
10. Summary
Ensure cluster health and keep usage below 70 % before starting.
Add OSDs in batches not exceeding 10‑15 % of the total.
Initialize new OSDs with weight 0 and raise the weight gradually.
Continuously monitor cluster status and business latency; be ready to pause recovery with norecover / nobackfill.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
