Databases 12 min read

An In‑Depth Overview of Apache BookKeeper: Architecture, Features, and Use Cases

This article provides a comprehensive technical overview of Apache BookKeeper, covering its role as a distributed append‑only log service, core concepts, high‑availability mechanisms, storage‑media evolution, comparisons with Raft, and community resources, while illustrating its use in Pulsar and large‑scale data platforms.

DataFunSummit
DataFunSummit
DataFunSummit
An In‑Depth Overview of Apache BookKeeper: Architecture, Features, and Use Cases

BookKeeper is an Apache project that provides a distributed, append‑only log service, serving as the underlying storage layer for Apache Pulsar and many other large‑scale systems.

The article first explains the unified business‑scenario needs that BookKeeper addresses, covering both online transactional workloads and big‑data streaming workloads, and how Pulsar leverages BookKeeper to achieve cloud‑native scalability, data migration, and high availability.

It then introduces BookKeeper’s core concepts such as Ledger, Fragment, Ensemble, Write Quorum, and Ack Quorum, describing how these parameters allow flexible tuning of bandwidth, consistency, and latency.

High‑availability mechanisms for reads and writes are detailed, including the peer‑to‑peer node design, writer‑driven coordination, and index tracking (LastAddPushed, LastAddConfirmed).

A comparison with Raft highlights similarities in term/segment handling and leader versus writer coordination.

The I/O separation architecture is explained, showing how the Journal writes to memory then flushes to disk, enabling independent scaling of read and write paths and the use of SSD or PMem for performance gains.

The evolution of storage media—from HDD to SSD, NVMe SSD, and finally persistent memory (PMem)—is discussed, with performance numbers from Yahoo’s deployment that achieved a five‑fold throughput increase with modest cost.

Finally, the article lists community resources, team composition, milestones, and commercial offerings from StreamNative, and provides links for further material downloads.

High AvailabilityPulsarstorageData InfrastructureApache BookKeeperDistributed Log
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.