Big Data 11 min read

Design and Optimization of Large‑Scale Small File Storage Using Ceph and Merged Object Techniques

The article analyzes the challenges of storing massive numbers of small files, reviews existing solutions such as TFS and Haystack, and proposes a Ceph‑based approach that merges small files into RADOS objects with metadata stored as extended attributes, detailing the read/write/delete workflow and hash‑based placement strategies.

Architect

Nov 5, 2015

Design and Optimization of Large‑Scale Small File Storage Using Ceph and Merged Object Techniques

Storing a massive amount of small files (LOSF) is a long‑standing problem in the industry; many companies have built custom solutions (e.g., Taobao's TFS, Facebook's Haystack) or adapted open‑source projects (HBase, FastDFS, MFS) to meet their specific needs.

The article identifies two core challenges for LOSF storage: (1) organizing and managing metadata for billions of files, where even a 100 B metadata record per file would require terabytes of memory; and (2) reducing the number of disk I/Os required to read a file from a traditional Linux file system, which typically needs three I/Os (directory lookup, inode load, data read).

Existing systems address the first challenge by embedding metadata in file names (TFS, FastDFS) or using distributed metadata architectures with high‑performance SSD servers (e.g., DragonStore). The second challenge is often tackled by merging many small files into larger containers and maintaining an index that maps each small file to its offset and size within the container, thereby allowing a single disk I/O for reads.

Ceph, widely adopted for distributed storage, solves the metadata problem with its CRUSH algorithm and a decentralized architecture, but its default storage engines (Filestore, KeyValueStore) do not efficiently handle the I/O pattern of small files. To overcome this, the authors propose a custom design built on Ceph's Filestore engine: multiple small files are concatenated into a single RADOS object, while a key‑value pair (file name, offset, size) is stored as an extended attribute (or omap) of that object.

Write operations are split into two steps: (1) append the file data to the target object; (2) store the file’s name, offset, and size as a KV entry in the object’s extended attributes. Read operations retrieve the KV entry to obtain offset and size, then read the corresponding byte range from the object. Deletion simply removes the KV entry, with space reclamation handled later. Because both data and metadata reside on the same OSD, all operations stay within the local process, avoiding network latency.

The design also discusses hash‑based placement of files into objects. The simplest method hashes the file key to select an object ID; more sophisticated schemes use directory paths as object IDs, allowing logical grouping of files and easier management via Ceph’s listxattr command. The article notes the need to pre‑plan the number of objects or to support dynamic scaling by marking old objects read‑only.

For files larger than the small‑file threshold (e.g., >1 MB), the solution recommends stripe‑splitting the file into equal‑sized chunks (e.g., 2 MB) and storing each chunk as an independent small file using the same merged‑object approach, with chunk metadata stored at the beginning of each chunk.

An example C++ API built on the librados interface is provided to expose the functionality to applications:

intWriteFullObj(const std::string& oid, bufferlist& bl, int create_time = GetCurrentTime());
intWrite(const std::string& oid, bufferlist& bl, uint64_t off, int create_time = GetCurrentTime());
intWriteFinish(const std::string& oid, uint64_t total_size, int create_time = GetCurrentTime());
intRead(const std::string& oid, bufferlist& bl, size_t len, uint64_t off);
intReadFullObj(const std::string& oid, bufferlist& bl, int* create_time = NULL);
intStat(const std::string& oid, uint64_t* psize, time_t* pmtime, MetaInfo* meta = NULL);
intRemove(const std::string& oid);
intBatchWriteFullObj(const String2BufferlistHMap& oid2data, int create_time = GetCurrentTime());

These functions allow direct writing of small files, batch writes for higher throughput, reading whole objects or specific ranges, and metadata queries, fitting the proposed storage model.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

metadata management Ceph object merging small-file storage

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.