Big Data 13 min read

An Introduction to Apache Pulsar: Core Concepts, Architecture, and Key Features

Apache Pulsar is a cloud‑native distributed messaging platform that combines messaging, storage, and lightweight compute, featuring multi‑tenant support, geo‑replication, and high throughput, and this article introduces its core concepts, architecture components such as brokers, BookKeeper, ZooKeeper, and key design features.

Big Data Technology & Architecture

Jul 27, 2021

Core Concepts

Apache Pulsar is an Apache Software Foundation top‑level project that positions itself as a next‑generation cloud‑native distributed messaging and streaming platform, integrating messaging, storage, and lightweight function‑as‑a‑service computing. It adopts a compute‑and‑storage separation architecture, supports multi‑tenant isolation, persistent storage, cross‑region replication, and offers strong consistency, high throughput, low latency, and high scalability.

Pulsar is a server‑to‑server messaging system originally developed at Yahoo and now governed by the Apache Foundation.

The Pulsar community is rapidly evolving, with the major version now at 2.8. This article serves as an introductory overview, with more detailed explorations to follow.

Key Concepts

Many concepts are similar to Kafka:

Topic : a namespace for messages where producers write and consumers read.

Bookie : storage nodes provided by Apache BookKeeper; messages are persisted in BookKeeper servers.

Ledger : the basic storage unit in BookKeeper, composed of ordered entries.

Journal : write‑ahead log for BookKeeper transactions.

Entry log : files that store entries from multiple ledgers.

Broker : a stateless server that handles load balancing and read/write operations, interacting with ZooKeeper for coordination.

MetaData Storage : stores metadata about BookKeeper (e.g., ledgers) using ZooKeeper.

Index file : indexes each ledger within the entry log for fast lookup.

Ledger cache : caches index files to accelerate searches.

Data compaction : merges fragmented entry logs (minor when <20% valid, major when <80% valid).

Key Features

Geo‑replication across clusters.

Extremely low publish and end‑to‑end latency.

Scales to over one million topics.

Simple client APIs for Java, Go, Python, and C++.

Multiple subscription modes: exclusive, shared, failover, key‑shared.

Persistent storage via Apache BookKeeper.

Serverless compute with Pulsar Functions.

Serverless connectors via Pulsar IO.

Tiered storage that offloads old data to cold storage (S3, GCS, filesystem).

Architecture Design

A Pulsar instance consists of one or more clusters; each cluster comprises:

One or more brokers that handle producer messages, interact with configuration storage, and store data in BookKeeper.

A BookKeeper cluster of multiple bookies for durable storage.

A ZooKeeper ensemble for metadata, configuration, and coordination.

Clusters can replicate data via geo‑replication.

ZooKeeper stores metadata and coordinates cluster tasks; local ZooKeeper handles intra‑cluster configuration, while a global ZooKeeper manages inter‑cluster replication.

BookKeeper provides durable storage; brokers are stateless; global replicators handle cross‑cluster data copying.

Apache BookKeeper Overview

BookKeeper offers:

Multiple ledgers for independent logs.

Efficient storage for ordered replicated entries.

Consistency guarantees across failures.

Even I/O distribution across bookies.

Horizontal scalability of capacity and throughput.

Support for thousands of ledgers per bookie with separate disks for logs and general storage.

Persistent cursors for consumer positions.

Ledgers

Ledgers are append‑only structures with a single writer; entries are replicated across bookies. Brokers can create, write to, and close ledgers. Closed ledgers become read‑only, and can be deleted when no longer needed.

Pulsar Geo‑Replication

Multiple brokers form a Pulsar cluster; multiple clusters form an instance. Geo‑replication synchronizes messages across clusters, enabling consumers in different regions to access the same data.

When producers publish to topics in different clusters, messages are instantly replicated, allowing consumers in other clusters to consume them.

Tiered Storage

Tiered storage moves old backlog messages from BookKeeper to cheaper storage (S3, GCS, filesystem) without affecting client access, reducing storage costs.

Core Design Principles

Message loss prevention: brokers are stateless; persistence is handled by BookKeeper.

Strong ordering guarantees in specific subscription modes.

Low read/write latency achieved via write‑ahead logs and in‑memory buffering.

For detailed latency comparisons between Pulsar and Kafka, see the referenced blog post: https://blog.csdn.net/zhaijia03/article/details/111602732 .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Streaming Apache Pulsar Distributed Messaging BookKeeper

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.