How Apache Pulsar Stores, Acknowledges, and Retains Messages
This article explains Apache Pulsar’s core architecture—including its message model, producer and consumer behavior, subscription types, storage hierarchy, acknowledgment mechanisms, cursor handling, backlog concepts, retention policies, and storage size calculations—helping readers understand why storage size can exceed backlog size.
Background
Apache Pulsar is an Apache Software Foundation top‑level project that provides a cloud‑native distributed messaging platform. It separates compute from storage, supports multi‑tenant deployments, persistent storage, cross‑region replication, and offers strong consistency, high throughput, low latency, and horizontal scalability.
Message model
Pulsar follows the classic publish‑subscribe model: producers append messages to topics and consumers read from subscriptions.
Producer
When a producer sends a message, the broker decides which partition to write to:
If a message key is set, the key is hashed and the message is routed to a specific partition.
If no key is set, the broker distributes messages round‑robin across the partitions, similar to Kafka.
Consumer & subscription layer
Consumers attach to a subscription , which abstracts access to a topic. Each subscription receives a full copy of the topic data, enabling both queue‑style (point‑to‑point) and streaming‑style (publish‑subscribe) consumption. Pulsar defines four subscription types:
Exclusive – only a single consumer can attach.
Failover – multiple consumers, but only the highest‑priority consumer receives messages; others take over on failure.
Shared – messages are distributed among all attached consumers (load‑balanced).
Key_Shared – messages with the same key are delivered to the same consumer while still allowing multiple consumers.
Storage model
Messages are stored exactly once in the distributed log of each partitioned topic, regardless of how many subscriptions exist. The storage hierarchy is:
Topic (partition)
Ledger (BookKeeper ledger)
Segment (BookKeeper segment)
Entry – a single message or a batch of messages.
Key points:
A batch is treated as a single entry on the broker; batch parsing happens on the consumer side.
BookKeeper’s smallest mutable unit is a segment, not an individual entry.
Acknowledgment (Ack) mechanisms
Pulsar supports two Ack modes:
AckIndividual – acknowledge a specific message by its MessageId.
AckCumulative – acknowledge all messages up to a given MessageId in one operation.
Cursor management
When a subscription is created, a cursor is initialized (default position is Latest ). The cursor records the highest acknowledged position for that subscription. After an Ack, the cursor moves forward, but the underlying data is not deleted immediately.
Cursor position can be expressed as: Cursor = Offset + IndividualDeletes In AckIndividual scenarios, gaps (holes) may appear because later messages can be acknowledged while earlier ones remain unacknowledged.
Backlog
Backlog represents unconsumed data:
Topic backlog – the union of the slowest subscription’s backlog.
Subscription backlog – unconsumed data for a specific subscription.
Two metrics are exposed: msgBacklog – count of unacknowledged entries (batches are counted as one entry). backlogSize – total size in bytes of unacknowledged messages.
Retention policy
Retention controls how long acknowledged messages are kept before they become eligible for deletion. It is configured per namespace or per topic with two parameters:
size – maximum storage size (0 = disabled, -1 = unlimited).
time – maximum retention time (0 = disabled, -1 = unlimited).
When either threshold is exceeded, messages enter a “ready‑to‑delete” state. Actual deletion occurs only at the segment level, because BookKeeper can only delete whole segments.
Time‑to‑live (TTL) for cursors
TTL can be set on a subscription. If a cursor does not advance within the configured TTL, Pulsar automatically moves the cursor forward to the TTL position. TTL affects cursor advancement only; it does not delete messages.
Clearing backlog
Administrators can forcibly clear a subscription’s backlog using the Pulsar admin CLI:
pulsar-admin topics clear-backlog -f my-tenant/my-namespace/my-topic -s my-subscriptionThe -f flag suppresses the confirmation prompt.
Storage size vs. backlog size
Because acknowledgments can be individual and retention may keep messages longer, storageSize (total persisted bytes) can be larger than backlogSize (bytes of unacknowledged data). Deletion only happens when all subscriptions have acknowledged a message **and** the retention thresholds have been satisfied, at which point the containing segment can be removed.
Summary of key concepts
Each message is stored once per partitioned topic.
Cursors track consumption state for each subscription.
Cursor = offset (Kafka‑like) + individual deletes.
Acks update the cursor position; cumulative acks move it forward in bulk.
A message becomes deletable only after **all** subscriptions have acknowledged it.
Unacknowledged messages remain in the subscription backlog.
TTL can automatically advance a cursor when it stalls.
Retention determines how long acknowledged messages are retained before they become eligible for segment‑level deletion.
Deletion operates at the segment granularity, not per entry.
Illustrations
Tencent Cloud Middleware
Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
