Operations 12 min read

Mastering Ceph: Essential Hardware & Software Optimizations for High‑Performance Distributed Storage

This guide outlines key hardware planning, SSD usage, BIOS settings, Linux and Ceph configuration tweaks, PG calculations, and performance tuning commands to optimize a Ceph distributed storage cluster for maximum throughput and reliability.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Mastering Ceph: Essential Hardware & Software Optimizations for High‑Performance Distributed Storage

Optimizing a distributed storage system involves several key aspects:

Hardware layer

Hardware planning

SSD selection

BIOS settings

Software layer

Linux OS

Ceph configurations

PG number adjustment

CRUSH map

Other factors

Hardware layer

1. CPU

Bind each ceph‑osd process to a dedicated CPU core because the ceph‑osd process consumes CPU resources.

Ceph‑mon processes use little CPU, so they do not need dedicated cores.

Ceph‑mds also consumes significant CPU and should be allocated more cores.

2. Memory

Ceph‑mon and ceph‑mds each require 2 GB of RAM, while each ceph‑osd process needs 1 GB.

3. Network

10 GbE is essentially required for Ceph; plan separate client and cluster networks and consider bonding for high availability or load balancing.

4. SSD

SSD can be used in Ceph in several ways:

SSD as journal

SSD as high‑performance pool (requires CRUSH map changes)

SSD as tiered pool

5. BIOS

Enable VT and HT for virtualization support and hyper‑threading.

Disable power‑saving settings for performance gains.

NUMA: either disable it in BIOS or bind ceph‑osd processes to specific CPU cores and memory nodes; on CentOS, add numa=off to /etc/grub.conf.

Software layer

1. Kernel pid max echo 4194303 > /proc/sys/kernel/pid_max 2. Set MTU (switch must support it) MTU=9000 3. read_ahead

echo "8192" > /sys/block/sda/queue/read_ahead_kb

4. swappiness

echo "vm.swappiness = 0" >/etc/sysctl.conf; sysctl -p

5. I/O scheduler (SSD: noop, SATA/SAS: deadline)

echo "deadline" > /sys/block/sd[x]/queue/scheduler
echo "noop" > /sys/block/sd[x]/queue/scheduler

6. ceph.conf configuration (excerpt)

[global]
fsid = 88caa60a-e6d1-4590-a2b5-bd4e703e46d9
mon host = 10.0.1.21,10.0.1.22,10.0.1.23
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd pool default size = 2
osd pool default min size = 1
osd pool default pg num = 128
osd pool default pgp num = 128
public network = 10.0.1.0/24
cluster network = 10.0.1.0/24
max open files = 131072
mon initial members = controller1, controller2, compute01
[mon]
mon data = /var/lib/ceph/mon/ceph-$id
mon clock drift allowed = 1
mon osd min down reporters = 13
mon osd down out interval = 600
[osd]
osd data = /var/lib/ceph/osd/ceph-$id
osd journal size = 20000
osd journal = /var/lib/ceph/osd/$cluster-$id/journal
osd mkfs type = xfs
osd mkfs options xfs = -f -i size=2048
filestore xattr use omap = true
filestore min sync interval = 10
filestore max sync interval = 15
filestore queue max ops = 25000
filestore queue max bytes = 1048576000
filestore queue committing max ops = 50000
filestore queue committing max bytes = 10485760000
filestore split multiple = 8
filestore merge threshold = 40
filestore fd cache size = 1024
journal max write bytes = 1073714824
journal max write entries = 10000
journal queue max ops = 50000
journal queue max bytes = 10485760000
osd max write size = 512
osd client message size cap = 2147483648
osd deep scrub stride = 131072
osd op threads = 16
osd disk threads = 4
osd map cache size = 1024
osd map cache bl size = 128
osd mount options xfs = "rw,noexec,nodev,noatime,nodiratime,nobarrier"
osd recovery op priority = 2
osd recovery max active = 10
osd max backfills = 4
osd min pg log entries = 30000
osd max pg log entries = 100000
osd mon heartbeat interval = 40
ms dispatch throttle bytes = 1048576000
objecter inflight ops = 819200
osd op log threshold = 50
osd crush chooseleaf type = 0
[client]
rbd cache = true
rbd cache size = 335544320
rbd cache max dirty = 134217728
rbd cache max dirty age = 30
rbd cache writethrough until flush = false
rbd cache max dirty object = 2
rbd cache target dirty = 235544320

7. PG number

PG and PGP counts must be adjusted based on the number of OSDs; the formula is:

Total PGs = (Total_number_of_OSD * 100) / max_replication_count

Example: 100 OSDs, 2 replicas, 5 pools → Total PGs = 100*100/2 = 5000; each pool gets 1000 PGs, so create the pool with pg=1024. ceph osd pool create pool_name 1024 8. Modify CRUSH map

The CRUSH map can assign different OSDs to pools and adjust OSD weights.

9. Other factors ceph osd perf Use ceph osd perf to monitor disk latency; OSDs with excessive latency should be removed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Configurationperformance tuningLinuxdistributed storageCephHardware Optimization
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.