Analysis of Ceph ObjectStore Backend Storage Engines: FileStore, NewStore, and BlueStore
The article provides a detailed technical overview of Ceph's ObjectStore backend storage engines—FileStore, NewStore, and BlueStore—explaining their architectures, transaction mechanisms, performance characteristics, and how they evolve from POSIX‑based file handling to direct block‑device storage.
Ceph is a highly available, strongly consistent software‑defined storage solution widely used across industries; this article focuses on the implementation methods and evolution of Ceph's backend storage engine, ObjectStore. Ceph supports multiple storage engines in a plug‑in fashion, currently including Filestore, Key‑Value Store, Memstore, NewStore, and the latest BlueStore, with Filestore being the default.
The ObjectStore layer encapsulates all I/O operations of the underlying storage engine and provides object and transaction semantics to the upper layers. MemStore is an in‑memory implementation; the Key‑Value Store mainly relies on KV databases such as LevelDB or RocksDB to implement the required interfaces.
FileStore is Ceph's current default storage engine and the most widely used one; its transaction implementation is based on a journal mechanism. Besides supporting transactional properties (consistency, atomicity, etc.), the journal can merge multiple small I/Os into sequential writes to improve performance.
The ObjectStore API is divided into three main parts: (1) object read/write operations, similar to POSIX interfaces; (2) object attribute (xattr) read/write operations, which are KV pairs associated with a specific object; (3) object‑related KV operations (called omap in Ceph).
ObjectStore Backend Storage Engine – FileStore
FileStore implements the ObjectStore API using POSIX file‑system interfaces. Each object appears as a file, and object attributes are stored via the file's xattr. Because some file systems (e.g., ext4) limit xattr length, metadata exceeding the limit is stored in a DBObjectMap, while object‑KV relationships also use DBObjectMap.
FileStore has several drawbacks: the journal mechanism turns a single write request into two writes on the OSD (synchronous journal write and asynchronous object write); using SSDs for the journal decouples journal and object writes; each object corresponds to a physical file on the OSD, which leads to metadata caching pressure and multiple local I/Os in small‑object workloads, degrading performance.
ObjectStore Backend Storage Engine – NewStore
To address the above issues, Ceph introduced a new storage engine called NewStore (also known as Key‑File Store). Its key structure is illustrated in the diagram below.
NewStore decouples objects from a one‑to‑one mapping with local physical files by using an index structure (ONode) to map objects to physical files and storing index data in a KV database. It retains transactional guarantees without requiring a journal, builds an ONode cache on top of the KV database for faster reads, and allows a single object to span multiple fragment files, enabling multiple objects to share a fragment file for greater flexibility.
ObjectStore Backend Storage Engine – BlueStore
NewStore uses RocksDB to store Ceph logs, while the actual data objects are stored in the file system. BlueStore advances this design by allowing data objects to be stored directly on raw block devices without any file‑system interface.
BlueStore aims to reduce write amplification and is optimized for SSDs; it manages raw disks directly, bypassing file‑system overhead (e.g., ext4, XFS). It is a brand‑new OSD backend that leverages block‑device hints for performance. The overall BlueStore architecture is shown below.
BlueStore directly manages raw devices, discarding the local file system. BlockDevice operates in user space to perform I/O on the raw device. Because it manages raw devices, it requires a space‑allocation component, currently supporting Stupid Allocator and Bitmap Allocator.
Metadata is stored in KV form using RocksDB (the default KV database). Although RocksDB is built on a file system, it abstracts system‑specific handling via an Env interface; Ceph implements a custom BlueRocksEnv to bridge RocksDB to the underlying system. BlueRocksEnv uses a small file system called BlueFS, which is mounted at startup, loading all metadata into memory. Data and log files managed by BlueFS are persisted to the raw device through BlockDevice, and BlueFS can be shared with or separated from BlueStore.
In the earlier Filestore engine, objects were represented as files (default size 4 MB). In the latest BlueStore implementation, there is no traditional file system; instead, objects are managed directly on raw disks, with the Onode structure kept in memory and persisted as KV entries in RocksDB.
In summary, FileStore is the current default engine that relies on POSIX interfaces, mapping objects to files and using xattr for attributes, which imposes length limits. NewStore decouples objects from physical files using KV databases and indexing techniques, eliminating the journal. BlueStore stores objects directly on block devices, achieving higher performance by removing the file‑system layer; it can be viewed as BlueStore = BlockDevice + NewStore.
Related reading:
Analysis of Ceph and 9000 Distributed Storage
Ceph Feature Updates that Make OpenStack Irresistible
Deep Analysis of Ceph Storage Architecture
Further Discussion on Ceph, the Open‑Source Storage Gem
Warm tip: Search for “ICT_Architect” or scan the QR code below to follow the public account for more exciting content.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.