How Pulsar Stores Messages and How BookKeeper’s GC Keeps Them Clean
This article explains Apache Pulsar’s message storage architecture in BookKeeper, details the ledger and entry lifecycle, describes the multi‑layer caching read path, and outlines BookKeeper’s garbage‑collection process along with practical operational tips for avoiding disk‑heavy scenarios.
Pulsar Message Storage Overview
Apache Pulsar is a multi‑tenant, high‑performance messaging platform that supports low latency, read/write separation, cross‑region replication, rapid scaling, and flexible fault tolerance. Tencent’s data platform team has deeply investigated Pulsar and optimized its performance and stability, deploying it in TDbank. This article focuses on how Pulsar stores messages in BookKeeper and how BookKeeper cleans up its storage files.
Pulsar brokers act as BookKeeper clients. Each topic partition corresponds to a series of ledgers , and at any time only one ledger per partition is open for writes. When a producer sends a message, the broker selects the current ledger, generates an entry ID , and writes the message as an entry (single‑message entry for non‑batch mode, or multiple messages per entry in batch mode). The resulting msgID consists of four parts: ledgerID, entryID, partition-index (‑1 for non‑partitioned topics), and batch-index (‑1 for non‑batch messages).
BookKeeper Storage Layout on Each Bookie
Each bookie stores incoming entries in three file types:
Journal files – write‑ahead logs (WAL) that are flushed to disk in real time. When a journal reaches its size limit it rolls over to a new file. It is recommended to place journal directories on separate SSDs to avoid I/O interference.
EntryLog files – contain the actual entry data. Entries are written randomly; the write cache ensures they are locally ordered before flushing.
Index files – built on top of RocksDB (via SingleDirectoryDbLedgerStorage) to map entries to their physical location in entry logs.
The write cache is split into two alternating parts: one currently being written and one being flushed to disk. An in‑memory index (implemented with a ConcurrentLongLongPairHashMap) allows fast lookup of entries within the cache. When entries are flushed, they are written to entry logs in a locally ordered fashion, which greatly improves subsequent read performance.
Consumer Read Path and Multi‑Layer Caching
When a Pulsar consumer reads data, the system checks several caches in order, returning as soon as the data is found:
Broker‑side entry cache.
Bookie write cache (the part currently being written).
Bookie write cache (the part currently being flushed).
Bookie read cache.
Disk‑based entry log files accessed via the RocksDB index.
If the data is retrieved from disk, it is stored in the read cache for faster future access, and because entry logs are locally ordered, adjacent entries are likely to be fetched together, further boosting efficiency.
BookKeeper Garbage‑Collection (GC) Mechanism
Each bookie runs a periodic GC task (default every 15 minutes) that performs the following steps:
Compare ledger IDs stored on the bookie with those recorded in ZooKeeper; delete any ledger IDs absent from ZooKeeper.
Calculate the live‑entry ratio for each entry log; delete an entry log if it contains zero live ledgers.
Delete entry log files whose metadata shows that all associated ledger IDs are invalid.
Compact entry logs:
Major GC (default daily) when live‑entry ratio > 0.5 – move surviving entries to a new file and delete the old one.
Minor GC (default hourly) when live‑entry ratio > 0.2 – same process but more frequent.
Ledger deletion is triggered by the broker, not by the bookie. A broker‑side thread (default every 2 minutes) examines each topic’s cursor position, identifies ledgers that have been fully consumed, and instructs ZooKeeper and the corresponding bookies to delete those ledgers.
Operational Issues and Mitigation
In practice, disk‑space exhaustion on bookies often occurs because:
Massive numbers of topics with only a few messages each cause many active ledgers that do not reach size or age thresholds, leaving numerous entry logs that cannot be reclaimed.
GC cycles become long when many entry logs simultaneously meet minor or major GC thresholds, delaying cleanup.
Mitigation strategies include restarting bookies to force ledger switches (ensuring active ledgers are not the ones being consumed) and optimizing the GC workflow to reduce the time spent in each cleanup round.
Key Takeaways
The article first described Pulsar’s message storage model, the flow of writing messages, and the multi‑layer caching read path. It then detailed the GC process of a single bookie. During operation, avoid consuming very old data that forces disk reads, and plan storage capacity carefully because the number of storage directories cannot be changed at runtime; scaling bookies up or down is required for capacity adjustments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Middleware
Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
