Backend Development 15 min read

Understanding Ceph PGLog: Format, Storage Mechanism, and Recovery Process

This article explains Ceph's PGLog structure, how it is serialized into transactions and stored in journals and LevelDB, and how the log participates in the recovery of degraded or failed placement groups during the peering process.

Architect
Architect
Architect
Understanding Ceph PGLog: Format, Storage Mechanism, and Recovery Process

Ceph's PGLog is maintained by each Placement Group (PG) and records all operations, similar to a database undo log. It typically stores up to 3,000 entries (default) but expands to 10,000 when the PG is in a degraded state, allowing recovery after a failed PG comes back online.

1. PGLog Format

Ceph uses a version‑control scheme where each update inside a PG is identified by an (epoch, version) pair; epoch is the OSD map version and increments on OSD topology changes, while version is a monotonically increasing counter assigned by the primary OSD.

Three main data structures implement the PGLog in code: pg_info_t , pg_log_t , and pg_log_entry_t . The log stores only object‑update metadata (no actual data or offsets); recovery therefore works at the object level (default object size is 4 MiB).

The key fields in the log are:

last_complete – all versions before this pointer have been fully applied in memory on every OSD.

last_update – the most recent version that has not yet been fully applied on all OSDs.

log_tail – the oldest entry in the PGLog.

head – the newest entry.

tail – the entry preceding the oldest one.

log – the list that actually stores pg_log_entry_t records.

2. PGLog Storage Mechanism

When a write I/O is issued, Ceph first packages it into a Transaction . The transaction is written to the journal; after the journal write completes, a callback chain eventually writes the data to the buffer cache. The PGLog is serialized into the same transaction and persisted together with the I/O.

During the journal write, the transaction’s buffer list is encoded and later flushed to disk asynchronously.

2.1 PGLog Update to Journal

2.1.1 Serialize Write I/O into Transaction

In ReplicatedPG::do_osd_ops , a write operation (CEPH_OSD_OP_WRITE) is encoded into ObjectStore::Transaction::tbl , a bufferlist that holds the operation.

ReplicatedPG::OpContext::op_t → PGBackend::PGTransaction::write (t->write) → RPGTransaction::write → ObjectStore::Transaction::write

2.1.2 Serialize PGLog into Transaction

In ReplicatedPG::prepare_transaction , ctx->log.push_back creates a pg_log_entry_t and appends it to the log vector.

ReplicatedBackend::submit_transaction calls parent->log_operation , which serializes the PGLog into the transaction via PG::append_log .

The serialized objects are stored as map<string, bufferlist> entries where the key is "epoch.version" and the value is the encoded pg_info_t or pg_log_entry_t . These maps are written as OMAP entries (object attributes) in the underlying ObjectStore.

2.1.3 Transaction Contents

The transaction buffer list contains different payloads depending on the operation type (e.g., OP_WRITE for data, OP_OMAP_SETKEYS for OMAP metadata).

2.1.4 Trim Log

When the number of entries exceeds the configured limit ( osd_min_pg_log_entries default 3000, osd_max_pg_log_entries default 10000 for degraded PGs), Ceph trims the oldest entries. The trim amount is calculated as log.head - log.tail - max_entries , but it never trims past min_last_complete_ondisk , the smallest last_complete version that is safely persisted on all replicas.

During trimming, keys to be removed are added to PGLog::trimmed , later serialized as omap_rmkeys in the transaction.

2.1.5 Write PGLog to Journal Disk

ReplicatedBackend::submit_transaction calls log_operation to serialize PGLog, then queue_transaction passes the transaction to the journal.

FileStore::queue_transactions wraps the list of Transaction* into a FileStore::Op .

JournalingObjectStore::_op_journal_transactions encodes each transaction into a bufferlist .

FileJournal::submit_entry creates a write item and pushes it to the write queue.

FileJournal::write_thread_entry dequeues the item, builds another bufferlist , and finally do_aio_write writes it asynchronously to the journal disk.

2.2 Write PGLog to LevelDB

When the OSD processes the transaction in FileStore::_do_op , it dispatches to FileStore::_do_transactions . Depending on the operation type, it either calls FileStore::_omap_setkeys (for OP_OMAP_SETKEYS ) or FileStore::_omap_rmkeys (for OP_OMAP_RMKEYS ), which ultimately invoke LevelDB’s set or rm_keys methods to persist or delete the PGLog entries.

Storing PGLog together with the journal ensures that if an OSD crashes after the journal is written but before LevelDB is updated, the journal replay will reconstruct the missing PGLog entries during OSD restart.

3. How PGLog Participates in Recovery

During the peering phase after a failed OSD rejoins, the primary PG builds a "missing" list by comparing its own pg_info and pg_log with those of its replicas. The steps are:

GetInfo : The primary OSD requests pg_info from all replicas; it merges the received histories, updating fields such as last_epoch_started and last_epoch_clean .

GetLog : The primary selects the OSD that holds the authoritative log (based on highest last_update , then smallest log_tail , then current primary). If the primary itself is not authoritative, it fetches the log from the chosen OSD and merges it via proc_master_log , adding any missing pg_log_entry_t OIDs to the missing list.

GetMissing : The primary pulls the missing objects from other replicas using the constructed missing list, completing the recovery.

In summary, PGLog is a crucial component of Ceph’s consistency and recovery mechanisms, providing an undo‑log‑like record of object updates, being persisted atomically with write transactions, and serving as the source of truth during peering and data repair.

backendLevelDBDistributed storageCephRecoveryPGLog
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.