Backend Development 33 min read

Inside Pulsar’s Bookie: A Deep Dive into Storage Architecture and Write/Read Paths

This article dissects Pulsar’s storage layer by examining the Bookie component, detailing its architecture, the sequential write‑ahead log, ledger management, journal handling, caching mechanisms, and the complete read/write call chains with concrete code examples and performance considerations.

Architect

May 30, 2024

Overview of Pulsar Storage Architecture

Pulsar follows a storage‑compute separation model: the broker handles traffic routing, aggregation and computation, while Bookie (also called Bookeeper) is a dedicated storage engine. Brokers embed a Bookie client and communicate with Bookie servers over TCP using protobuf.

Bookie Architecture Design

Bookie consists of several layers:

BookieRequestProcessor : request routing component.

WriteEntryProcessorV3 / ReadEntryProcessorV3 : core thread‑pool tasks for write and read operations.

Bookie : storage engine interface.

LedgerDescriptor : manager for a specific ledger.

LedgerStorage : abstract interface for ledger‑level storage.

Journal : sequential write‑ahead log (WAL).

Write Path – From Client to Disk

The client creates a LedgerHandle and calls addEntry. The call chain is:

ManagedLedgerImpl#asyncAddEntry()

ManagedLedgerImpl#internalAsyncAddEntry()

LedgerHandle#asyncAddEntry()

OpAddEntry#initiate()

LedgerHandle#doAsyncAddEntry()

BookieClient#addEntry()

On the server side, BookieRequestProcessor#processRequest dispatches the request to processAddRequestV3 . The request is placed into a thread‑pool that uses an orderingKey (hash of LedgerId) to map the request to a specific single‑threaded executor, reducing queue contention.

WriteEntryProcessorV3#run calls getAddResponse(), which decides between recoveryAddEntry (used during node recovery) and addEntry. addEntry is implemented in BookieImpl and synchronizes on the LedgerDescriptor to guarantee sequential writes per ledger.

Inside BookieImpl#addEntryInternal the data is first written to LedgerStorage (e.g., DbLedgerStorage) and then to the Journal . The journal uses a write‑cache (multiple parallel journal directories). The entry is appended to the journal buffer; if synchronous flush is enabled the request blocks until the data is persisted.

The DbLedgerStorage implementation stores the mapping (ledgerId, entryId) → location in RocksDB. It delegates the actual write to SingleDirectoryDbLedgerStorage , which buffers entries in writeCache. When the cache reaches a threshold or a timeout expires, triggerFlushAndAddEntry swaps the cache, sorts entries for local ordering, writes them to EntryLog and EntryIndex , and finally flushes both to disk.

ClientConfiguration conf = new ClientConfiguration();
conf.setThrottleValue(bkthrottle);
conf.setMetadataServiceUri("zk://" + zkservers + "/ledgers");
BookKeeper bkc = new BookKeeper(conf);
LedgerHandle ledger = bkc.createLedger(3, 2, 2, DigestType.CRC32, new byte[]{'a','b'});
long entryId = ledger.addEntry("ABC".getBytes(UTF_8));

Read Path – From Disk to Client

Read requests are processed by BookieRequestProcessor#processReadRequestV3 , which also uses a high‑priority thread‑pool. The flow proceeds to ReadEntryProcessorV3#run and ultimately calls BookieImpl#readEntry . The method obtains the appropriate LedgerDescriptor , then calls LedgerDescriptor#readEntry , which forwards to LedgerStorage#getEntry . The concrete implementation is SingleDirectoryDbLedgerStorage#doGetEntry .

The read algorithm checks three cache levels in order:

writeCache (most recent writes).

writeCacheBeingFlushed (entries pending flush).

readCache (cached entries from previous reads).

If all caches miss, the system looks up the physical offset via EntryLocationIndex (RocksDB mapping) and reads the entry from the EntryLog using DefaultEntryLogger#internalReadEntry . The method parses the length field, allocates a buffer of the appropriate size, and reads the entry data. After a successful read, the entry is placed into the read cache and a pre‑fetch for the next entry is triggered.

Component Module Analysis

Key modules and their responsibilities:

BufferedChannel / BufferedReadChannel : in‑memory buffers (default 64 KB write buffer, 512 B read buffer) that decouple I/O from the application thread.

EntryLogger : abstracts actual entry persistence; two implementations exist – DefaultEntryLogger (uses RandomAccessFile) and DirectEntryLogger (uses direct I/O).

LedgerStorage implementations : DbLedgerStorage (RocksDB KV), SortedLedgerStorage (Java skip‑list with sorting), InterleavedLedgerStorage (delegates to EntryLogger).

Journal : background thread that drains a queue of QueueEntry objects, writes them to a buffered channel, and periodically triggers a ForceWriteThread to force data to disk based on size, time, or explicit sync settings.

SyncThread : periodically calls LedgerStorage#checkpoint to flush the write‑cache to entry log and index.

Write/Read Call Chain Summary

Write chain: client → BookKeeper → LedgerHandle → BookieClient → BookieRequestProcessor → WriteEntryProcessorV3 → BookieImpl#addEntry → LedgerDescriptor → LedgerStorage → EntryLogger → BufferedChannel → flush → ForceWriteThread.

Read chain: client → BookKeeper → BookieClient → BookieRequestProcessor → ReadEntryProcessorV3 → BookieImpl#readEntry → LedgerDescriptor → LedgerStorage → EntryLocationIndex → EntryLogger → BufferedReadChannel → return entry.

Architectural Conclusions

Bookie achieves high write throughput and low latency by:

Using a sequential WAL (Journal) with optional parallel directories.

Buffering writes in writeCache and flushing in batches, which reduces disk‑seek overhead.

Separating entry data from index data (RocksDB) to allow local ordering and efficient range reads.

For reads, a three‑level cache hierarchy minimizes disk I/O, and the local ordering of entries in the entry log enables near‑sequential reads, improving consumer performance.

Overall, the design balances write‑heavy workloads typical of messaging systems with the need for ordered reads, while providing configurable sync/async modes to trade latency for durability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Java performance Pulsar storage architecture Ledger bookie

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.