Mastering Ceph: Essential Hardware & Software Optimizations for High‑Performance Distributed Storage
This guide outlines key hardware planning, SSD usage, BIOS settings, Linux and Ceph configuration tweaks, PG calculations, and performance tuning commands to optimize a Ceph distributed storage cluster for maximum throughput and reliability.
Optimizing a distributed storage system involves several key aspects:
Hardware layer
Hardware planning
SSD selection
BIOS settings
Software layer
Linux OS
Ceph configurations
PG number adjustment
CRUSH map
Other factors
Hardware layer
1. CPU
Bind each ceph‑osd process to a dedicated CPU core because the ceph‑osd process consumes CPU resources.
Ceph‑mon processes use little CPU, so they do not need dedicated cores.
Ceph‑mds also consumes significant CPU and should be allocated more cores.
2. Memory
Ceph‑mon and ceph‑mds each require 2 GB of RAM, while each ceph‑osd process needs 1 GB.
3. Network
10 GbE is essentially required for Ceph; plan separate client and cluster networks and consider bonding for high availability or load balancing.
4. SSD
SSD can be used in Ceph in several ways:
SSD as journal
SSD as high‑performance pool (requires CRUSH map changes)
SSD as tiered pool
5. BIOS
Enable VT and HT for virtualization support and hyper‑threading.
Disable power‑saving settings for performance gains.
NUMA: either disable it in BIOS or bind ceph‑osd processes to specific CPU cores and memory nodes; on CentOS, add numa=off to /etc/grub.conf.
Software layer
1. Kernel pid max echo 4194303 > /proc/sys/kernel/pid_max 2. Set MTU (switch must support it) MTU=9000 3. read_ahead
echo "8192" > /sys/block/sda/queue/read_ahead_kb4. swappiness
echo "vm.swappiness = 0" >/etc/sysctl.conf; sysctl -p5. I/O scheduler (SSD: noop, SATA/SAS: deadline)
echo "deadline" > /sys/block/sd[x]/queue/scheduler
echo "noop" > /sys/block/sd[x]/queue/scheduler6. ceph.conf configuration (excerpt)
[global]
fsid = 88caa60a-e6d1-4590-a2b5-bd4e703e46d9
mon host = 10.0.1.21,10.0.1.22,10.0.1.23
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd pool default size = 2
osd pool default min size = 1
osd pool default pg num = 128
osd pool default pgp num = 128
public network = 10.0.1.0/24
cluster network = 10.0.1.0/24
max open files = 131072
mon initial members = controller1, controller2, compute01
[mon]
mon data = /var/lib/ceph/mon/ceph-$id
mon clock drift allowed = 1
mon osd min down reporters = 13
mon osd down out interval = 600
[osd]
osd data = /var/lib/ceph/osd/ceph-$id
osd journal size = 20000
osd journal = /var/lib/ceph/osd/$cluster-$id/journal
osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
filestore xattr use omap = true
filestore min sync interval = 10
filestore max sync interval = 15
filestore queue max ops = 25000
filestore queue max bytes = 1048576000
filestore queue committing max ops = 50000
filestore queue committing max bytes = 10485760000
filestore split multiple = 8
filestore merge threshold = 40
filestore fd cache size = 1024
journal max write bytes = 1073714824
journal max write entries = 10000
journal queue max ops = 50000
journal queue max bytes = 10485760000
osd max write size = 512
osd client message size cap = 2147483648
osd deep scrub stride = 131072
osd op threads = 16
osd disk threads = 4
osd map cache size = 1024
osd map cache bl size = 128
osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"
osd recovery op priority = 2
osd recovery max active = 10
osd max backfills = 4
osd min pg log entries = 30000
osd max pg log entries = 100000
osd mon heartbeat interval = 40
ms dispatch throttle bytes = 1048576000
objecter inflight ops = 819200
osd op log threshold = 50
osd crush chooseleaf type = 0
[client]
rbd cache = true
rbd cache size = 335544320
rbd cache max dirty = 134217728
rbd cache max dirty age = 30
rbd cache writethrough until flush = false
rbd cache max dirty object = 2
rbd cache target dirty = 2355443207. PG number
PG and PGP counts must be adjusted based on the number of OSDs; the formula is:
Total PGs = (Total_number_of_OSD * 100) / max_replication_countExample: 100 OSDs, 2 replicas, 5 pools → Total PGs = 100*100/2 = 5000; each pool gets 1000 PGs, so create the pool with pg=1024. ceph osd pool create pool_name 1024 8. Modify CRUSH map
The CRUSH map can assign different OSDs to pools and adjust OSD weights.
9. Other factors ceph osd perf Use ceph osd perf to monitor disk latency; OSDs with excessive latency should be removed.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
