Big Data 12 min read

How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads

This article details Vivo's end‑to‑end Pulsar observability solution, covering the challenges of Prometheus‑based monitoring, the architecture of the alerting pipeline, adaptor development, metric optimizations for subscription backlog and bundle load, and fixes for kop lag reporting issues.

vivo Internet Technology

Jun 11, 2025

How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads

Background

Vivo adopted Apache Pulsar as the next‑generation messaging middleware because it offers superior architecture and features compared with Kafka. To ensure reliable performance during research such as load and fault testing, a cloud‑native observability system was essential.

Pulsar Monitoring and Alerting Architecture

Prometheus is used to scrape Pulsar metrics, but its built‑in TSDB is not distributed and consumes significant memory and CPU. To offload storage and query load, a remote storage layer (Pulsar or Kafka) is added. The overall architecture is illustrated below.

The components function as follows:

Pulsar, BookKeeper and other services expose metric endpoints.

Prometheus scrapes these endpoints.

An adaptor discovers services, deserializes Prometheus‑format metrics, and forwards them to remote storage (Pulsar or Kafka).

Druid consumes the metric topics and provides analytical capabilities.

Vivo's internal alert platform dynamically configures detection rules.

Key Issues and Solutions

1. Adaptor Development for Prometheus Integration

The adaptor needed to ingest all online Pulsar services and support dynamic Prometheus configuration. It implements service discovery and auto‑loading of scrape configs, referencing the official Prometheus documentation for dynamic configuration and custom service discovery.

2. Druid Plugin Enhancement for Counter Metrics

Grafana’s Druid plugin originally struggled with Counter‑type metrics. Vivo’s Druid team added a rate‑calculation aggregation function to improve visualization of such metrics.

3. Prometheus Parameter Tuning

--storage.tsdb.retention=30m : keeps data for 30 minutes only.

--storage.tsdb.min-block-duration=5m : sets the minimum block size to 5 minutes, limiting memory usage.

--storage.tsdb.max-block-duration=5m : caps block size at 5 minutes to avoid large blocks.

--enable-feature=memory-snapshot-on-shutdown : writes an in‑memory snapshot to disk on shutdown for faster restarts.

Pulsar Metric Optimizations

3.1 Subscription Backlog Metric

The native Pulsar backlog metric reports entries, which is confusing for Kafka‑migrated users who expect message counts. By enabling

brokerEntryMetadataInterceptors=org.apache.pulsar.common.intercept.AppendIndexMetadataInterceptor

, Pulsar adds monotonically increasing index metadata to each entry, allowing backlog calculation in message counts per partition.

biz-log-partition-1 -l 24622961 -e 6
Batch Message ID: 24622961:6:0
Publish time: 1676917007607
Event time: 0
Broker entry metadata index: 157398560244
Properties:
"X-Pulsar-batch-size    2431"
"X-Pulsar-num-batch-message    50"

Visualization of the new backlog metric (in messages) versus the native metric (in entries) shows a clearer increase when sending large batches.

3.2 Bundle Load Metrics

Pulsar’s automatic load‑balancing operates at the bundle level. To aid tuning, Vivo collects bundle traffic distribution and producer/consumer connection counts, also marking idle (unassigned) bundles.

3.3 kop Consumption Lag Reporting Issue

After restarting the kop Coordinator, consumption‑lag metrics sometimes drop to zero because the Kafka‑lag‑exporter filters out members whose assignment field is missing. The root cause is that the new broker does not deserialize the assignment when the group metadata is transferred.

Fix: modify GroupMetadataConstants#readGroupMessageValue() to correctly read and set the assignment field during deserialization.

Conclusion

Through the construction of a cloud‑native Pulsar observability system, Vivo addressed user‑experience, operational efficiency, and availability challenges, accelerating Pulsar adoption. Nevertheless, single‑point bottlenecks remain, indicating ongoing challenges for future scaling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Big Data Observability metrics Prometheus Pulsar

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.