Backend Development 11 min read

Deep Dive into Ceph BlueStore Architecture and ObjectStore Evolution

This article provides a comprehensive technical overview of Ceph's storage back‑ends, detailing the evolution from FileStore to NewStore and the latest BlueStore implementation, its architecture, raw‑device management, and performance advantages for distributed storage systems.

Architects' Tech Alliance

Aug 16, 2018

Deep Dive into Ceph BlueStore Architecture and ObjectStore Evolution

Ceph is a distributed, strongly consistent software‑defined storage solution whose stability, reliability, and manageability have improved as more enterprises adopt it; features such as CephFS, iSCSI, and InfiniBand have been added, and this article analyzes the latest backend storage BlueStore architecture and the historical evolution of ObjectStore, noting that backend architecture significantly influences Ceph performance. SUSE was the first to support BlueStore in its Enterprise Storage 5 release.

BlueStore is the newest implementation of Ceph's ObjectStore backend; in Ceph, data is first hashed to OSD nodes and then persisted to disk by the ObjectStore, so we first examine the ObjectStore architecture.

ObjectStore Architecture Overview

Ceph supports multiple storage engines as plug‑ins, currently including Filestore (the default), Key‑Value Store, Memstore, NewStore, and the newest BlueStore.

From an architectural perspective, ObjectStore encapsulates all I/O operations of the underlying storage engine and provides object and transaction semantics to the upper layers. MemStore implements the interface in memory, while the Key‑Value Store relies on KV databases such as LevelDB or RocksDB.

Historically, Filestore has been Ceph's default ObjectStore backend. It uses a journal mechanism to provide transaction support; besides consistency and atomicity, the journal merges many small I/O writes into sequential writes to improve performance.

The ObjectStore interface consists of three parts: (1) object read/write operations similar to POSIX, (2) object attribute (xattr) read/write operations, which are KV operations associated with a specific object, and (3) KV operations linked to an object (called omap in Ceph).

FileStore Backend

FileStore implements the ObjectStore API using the POSIX file‑system interface. Each object appears as a file; object attributes are stored as file xattrs. Because some file systems (e.g., ext4) limit xattr length, metadata exceeding the limit is stored in DBObjectMap, and the object‑KV relationship also uses DBObjectMap.

FileStore has drawbacks: the journal causes each write request to be split into two writes (synchronous journal write and asynchronous object write); SSDs can mitigate this but not eliminate it. Moreover, each object maps to a physical file on the OSD's local file system, which can overwhelm metadata caches when many small objects are stored, leading to multiple local I/O operations and degraded performance.

NewStore Backend

To address FileStore's limitations, Ceph introduced NewStore (also called KeyFile Store). Its key structure is shown in the diagram below.

NewStore decouples objects from a one‑to‑one relationship with physical files by using an index structure (ONode) to map objects to physical files and stores the index in a KV database. It retains transaction semantics without requiring a journal, builds an ONode cache on top of the KV database for faster reads, and allows a single object to span multiple fragment files, enabling multiple objects to share a fragment file for greater flexibility.

BlueStore Backend

NewStore stores Ceph logs in RocksDB while the actual data objects reside in the file system. BlueStore eliminates the need for any file‑system interface; data objects are written directly to raw block devices, reducing write amplification and optimizing for SSDs by managing the raw disk itself.

BlueStore's goal is to improve performance by bypassing file‑system overhead (e.g., ext4, XFS) and managing raw devices via a block‑device layer. The overall BlueStore architecture is illustrated below.

BlueStore directly manages raw devices, discarding the local file system. It includes an Allocator for space management, currently supporting Stupid Allocator and Bitmap Allocator.

Metadata is stored as KV entries in RocksDB (the default). Although RocksDB itself relies on a file system, it abstracts system interactions via an Env interface; BlueStore implements a BlueRocksEnv to provide this abstraction and introduces a small file system called BlueFS that works with BlueRocksEnv. When the system starts, BlueFS mounts and loads all metadata into memory, while both BlueFS and BlueStore persist data and log files to the block device via BlockDevice.

When BlueFS and BlueStore share a device, the device is typically partitioned into two sections: a small partition for BlueFS, which offers file‑system‑like services required by RocksDB, and a larger partition managed directly by BlueStore for actual data storage.

One part of the device is a small BlueFS partition providing RocksDB‑like file‑system functionality.

The remaining part is a large partition managed directly by BlueStore, containing all real data.

In FileStore, objects are stored as files (default 4 MB size). In contrast, BlueStore manages raw disks directly; object Onodes reside in memory and are persisted as KV entries in RocksDB.

In summary, starting with SUSE Enterprise Storage 5, BlueStore becomes Ceph's new storage backend, delivering better performance than FileStore along with built‑in data verification and compression capabilities.

FileStore writes data to POSIX‑compatible file systems such as Btrfs, XFS, or Ext4. While using traditional Linux file systems offers some benefits, it also incurs costs like reduced performance and limitations on matching object attributes to local file‑system attributes.

NewStore decouples objects from physical files using KV databases and indexing techniques, optimizing log operations.

BlueStore allows data objects to be stored directly on raw block devices without any file‑system interface, dramatically improving Ceph storage system performance.

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.