Cloud Computing 15 min read

Understanding Ceph RBD Snapshots, Clones, and Recovery Mechanisms

This article explains the implementation of Ceph RBD snapshots, the associated metadata structures, clone operations, and the detailed recovery processes for both replica and primary OSDs, illustrating how copy‑on‑write and snapshot chains affect data consistency and performance.

Architect
Architect
Architect
Understanding Ceph RBD Snapshots, Clones, and Recovery Mechanisms

Recently I have been studying Ceph RBD snapshots and their fault‑recovery logic. Although based on copy‑on‑write (COW), Ceph’s snapshot metadata management and parent‑child relationship handling are unique, and this article details the implementation principles and the crucial role of snapshot objects during recovery.

1. Intuitive Understanding of RBD Snapshots Ceph RBD volumes can take multiple snapshots; each write after a snapshot triggers a COW operation that copies the original data into a snapshot object before writing new data. The snapshot operation is fast because it only updates volume metadata, adding snapshot IDs and parent information to the rbd_header object stored as omap entries in LevelDB.

When the original volume is written after a snapshot, the entire data object (size may differ from the default 4 MB) is cloned to create a snap object, while the original object (head) receives the new data. New objects are allocated lazily on first write.

The article then walks through several snapshot scenarios, showing how objects are cloned and how parent‑child relationships are recorded, with examples of multiple snapshots and subsequent writes.

2. Key Snapshot Data Structures Important structures include the RBD-side snap metadata (seq, snaps) and the librados snap information (snap_seq, snap_id). On the OSD side, SnapSet stores seq, head_exists, snaps, clones, clone_overlap, and clone_size, which together describe snapshot ordering, existence of the head object, and clone relationships.

Images illustrate these structures and the handling of clone_overlap intervals, which track unwritten regions of the head object after each clone.

3. Cloning Cloning a volume from a snapshot creates a new device that inherits the parent‑child chain. Reads traverse this chain until the needed object is found; writes first clone the data before modifying it. Long clone chains can degrade performance, so Ceph provides a flatten operation to break the chain, albeit at a cost.

Two scenarios demonstrate how read requests resolve through the clone chain and how clone IDs are selected during object lookup.

4. Data Recovery with Snapshot Objects Ceph’s PG‑log‑based recovery logic uses snapshot objects. Recovery differs for replicas and primaries. For replicas, the process restores older snapshots first, then newer ones, handling data_subsets (data to transfer) and clone_subsets (data that can be cloned locally). For primaries, the head object is restored first, followed by snapshots, because the primary may have taken new snapshots after failure.

Detailed examples show how data_subsets and clone_subsets are calculated based on clone_overlap intervals, and how these subsets guide network transfers and local cloning during recovery.

The article concludes with observations about the differing recovery order for replicas and primaries and raises open questions about the design choices.

Data Recoverydistributed-storageSnapshotsCephCOWRBDClones
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.