How Apache Pulsar Achieves Sub‑millisecond Write Latency on NVMe

This article explains Apache Pulsar's architecture, client‑to‑broker and broker‑to‑bookie latency components, data storage model, write path, journal flush strategies, and presents detailed benchmark results showing sub‑millisecond write latency and up to 1.5 million TPS on NVMe storage.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
How Apache Pulsar Achieves Sub‑millisecond Write Latency on NVMe

Apache Pulsar is a high‑performance message queue with a compute‑storage separation architecture, offering advanced features such as delayed messages and cross‑cluster geo‑replication, making it suitable for service decoupling in serverless designs.

Client Write Latency

The client write latency consists of two stages: (1) the client sends data to the broker, and (2) the broker concurrently writes to multiple Bookie replicas for persistence.

Broker Latency

Using the pulsar_broker_publish_latency metric and thread‑level flame graphs, most broker latency is attributed to a few network I/O operations.

Data Storage Model

Pulsar's ledger files act like RocksDB commit logs but with better performance due to batch sorting, providing locally ordered data for each topic.

Multi‑level caching writes data to WAL and WriteCache; consumers first read from WriteCache, falling back to RocksDB if needed, then populate ReadCache.

Only one ledger file is open per disk at any time, similar to RocksDB.

Write Process

The broker selects an ensemble of Bookie replicas and issues parallel write requests.

Bookies write to a writeCache (default ¼ of off‑heap memory) and batch asynchronously.

Data is written to the journal (WAL) and flushed.

The journal thread moves data to an in‑memory buffer, which is flushed to PageCache when full.

The ForceWrite thread forces the PageCache WAL data to disk.

After completion, the broker invokes the client callback to signal success.

Journal Flush Strategy

Pulsar provides three flush conditions; any one triggers a flush:

# Enable forced flush after data reaches PageCache
journalSyncData=true
# Enable journal writing
journalWriteData=true
# Flush when the journal queue is empty
journalFlushWhenQueueEmpty=true

Additionally, aligning journal block size with SSD sector size (4 KB) improves performance:

# Align journal and read buffer sizes to 4096 bytes
journalAlignmentSize=4096
readBufferSizeBytes=4096

Flame Graph Analysis

After tuning the above parameters, flame graphs show no abnormal blocking points in the write path.

Performance Test Setup

Test environment: 2‑node broker (4 CPU, 16 GB RAM, 25 Gbps network) and 3‑node Bookie (4 × 4 TB NVMe, 25 Gbps network). Tests cover synchronous, asynchronous, and compressed workloads with varying message sizes.

# Create a single‑partition topic
bin/pulsar-admin --admin-url http://192.0.0.1:8080 topics create-partitioned-topic persistent://public/default/test_qps -p 1
# Set replication factors
bin/pulsar-admin --admin-url http://192.0.0.1:8080 namespaces set-persistence public/default \
  --bookkeeper-ensemble 2 \
  --bookkeeper-write-quorum 2 \
  --bookkeeper-ack-quorum 2
# Synchronous test (single thread, no batching)
bin/pulsar-perf produce persistent://public/default/test_qps \
  -u pulsar://192.0.0.1:6650 \
  --disable-batching --batch-max-messages 1 --max-outstanding 1 \
  --rate 500000 --test-duration 120 --busy-wait --size 1024 > 1024.log &
# Asynchronous test with compression
export OPTS="-Xms10g -Xmx10g -XX:MaxDirectMemorySize=10g"
bin/pulsar-perf produce persistent://public/default/test_qps_async \
  -u pulsar://192.0.0.1:6650 \
  --batch-max-messages 10000 --memory-limit 2G --rate 2000000 \
  --busy-wait --compression LZ4 --size 1024 > 1024.log &

Test Results

In synchronous mode, a single message round‑trip (client → broker → Bookies → client) takes about 0.3 ms. In asynchronous batch mode, a single client can achieve 1 million writes per second; with compression, throughput rises to 1.5 million TPS while preserving order.

Sync latency is dominated by network round‑trip.

Async mode batches disk‑flush operations, greatly reducing latency.

Single‑producer, single‑partition workloads use one I/O thread, so multi‑threaded clients see similar latency.

Disk Performance

Using fio to simulate single‑threaded page‑cache writes with synchronous flush on NVMe yields ~18 µs for page‑cache write and ~26 µs for fsync, totaling ~44 µs. Adding thread‑switch and queue overhead, a full write‑and‑flush in Bookie costs roughly 100 µs. Network latency within the same AZ adds ~0.05 ms.

Consequently, on NVMe storage, a Pulsar client write that succeeds on two replicas and receives the acknowledgment averages 0.3 ms.

backend developmentLatencyMessage QueueApache PulsarNVMe
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.