Industry Insights 17 min read

How to Supercharge Ceph on Huawei Kunpeng ARM: Deep Performance Tuning Guide

This article examines Ceph’s architecture, identifies performance bottlenecks on Huawei’s Kunpeng ARM platform, and presents practical tuning methods—including NUMA placement, cache tagging, vector acceleration, thread scaling, and monitoring tools—to improve storage efficiency, reduce latency, and lower power consumption.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
How to Supercharge Ceph on Huawei Kunpeng ARM: Deep Performance Tuning Guide

Background and Motivation

With the rapid growth of IoT, big data, and mobile connectivity, the amount of generated data has surged, driving the storage market toward distributed solutions. Ceph, as a representative distributed storage system, is expected to account for 70% of the market by 2027. Shanyan Data, a software‑defined storage vendor, has partnered with Huawei to adapt its object storage (MOS) and block storage (USP) products for the Kunpeng ARM platform.

Ceph Services Overview

Ceph offers three user‑facing services:

Block storage (RBD) : behaves like an unformatted USB drive that must be formatted before use, providing a block device interface.

Object storage (RGW) : stores massive, irregular files such as cloud‑drive data, solving the problem of unique data objects.

File system (CephFS) : delivers a ready‑to‑use file system that can be mounted directly, similar to a pre‑installed Windows PC.

Key Ceph Components

MON : cluster brain, maintains status and metadata.

MDS : metadata service for CephFS.

OSD : object storage devices that store user data; performance of OSD determines overall system performance. OSD uses either FileStore (with a local filesystem like XFS) or BlueStore (direct device access).

Current Architectural Issues

Two major problems hinder performance on the existing Ceph architecture:

Separation between Ceph data and kernel cache: BlueStore stores OSD metadata in RocksDB on a simplified filesystem (BlueFS), which the kernel cache cannot differentiate from user data.

Kernel cache cannot distinguish hot (primary replica) from cold (secondary replica) data, leading to wasted cache space and reduced hit rates.

Proposed Solution: Cache Tagging Layer

A tagging layer is added both at the BlueStore I/O submission point and at the kernel cache entry point. I/O is marked with flags indicating its type (metadata, hot data, cold replica, etc.). The kernel cache then applies different write‑back and eviction policies based on these tags.

Example: mark secondary replicas with a NOCACHE tag so the kernel cache skips caching them, while metadata receives a longer residency tag.

ARM‑Specific Challenges on Kunpeng

The Kunpeng processor differs from Intel Xeon in six areas:

Cross‑chip (NUMA) memory access is less efficient.

Vector computation performance is weaker.

Higher physical core count provides parallelism.

Rich accelerator support (EC, RSA, zlib).

Load‑store micro‑architecture requires CPU involvement for memory‑to‑memory copies.

Better per‑core power efficiency.

Kunpeng‑Focused Ceph Optimizations

NUMA‑aware process placement : Use the OSD_numa_node parameter (or numactl / taskset) to bind OSD processes to a specific NUMA node, and keep network cards, SSDs, and memory on the same node.

Accelerator‑assisted vector operations : Leverage Kunpeng 920’s EC/RSA/zlib accelerators for erasure‑coding and other vector‑heavy workloads, following Huawei’s accelerator API documentation.

Increase threads and cores : Exploit the higher core count by adding more OSD threads and splitting busy threads across cores.

Memory‑operation patches : Apply Huawei‑provided kernel patches that optimize load‑store paths.

IRQ and cgroup tuning : Disable irqbalance, bind NIC/SSD interrupts to specific CPUs, and isolate heavy workloads with cgroups to improve cache locality.

Third‑party memory allocators : Use tcmalloc or jemalloc to reduce allocation overhead.

Performance Observation Tools

Ceph OSD perf / perf daemon : Built‑in tools that record I/O latency and internal processing times.

Linux perf suite : perf top, perf stat, and perf record (combined with FlameGraph) to locate hot functions.

SystemTap : Kernel‑level tracing, especially useful on Red Hat systems, to analyze function‑level bottlenecks.

Results After Optimization

Compatibility testing showed a smooth migration of Ceph to the Kunpeng platform without major blockers. Benchmarks demonstrated noticeable latency reduction and throughput gains, primarily limited by the underlying HDDs rather than CPU. Power measurements confirmed the expected lower per‑core consumption of ARM, reducing operational costs.

Future Plans

Deploy full‑flash storage on TaiShan servers, leveraging RDMA support to eliminate Ethernet bottlenecks.

Adopt the Seastar framework to mitigate cross‑NUMA penalties on ARM.

Integrate hardware accelerators for encryption, compression, and erasure coding in secure storage products.

These efforts aim to further close the performance gap between ARM and x86 in distributed storage environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performance tuningARMCephcache optimizationNUMAKunpeng
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.