Big Data 12 min read

How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads

This article details Vivo's end‑to‑end Pulsar observability solution, covering the challenges of Prometheus‑based monitoring, the architecture of the alerting pipeline, adaptor development, metric optimizations for subscription backlog and bundle load, and fixes for kop lag reporting issues.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads

Background

Vivo adopted Apache Pulsar as the next‑generation messaging middleware because it offers superior architecture and features compared with Kafka. To ensure reliable performance during research such as load and fault testing, a cloud‑native observability system was essential.

Pulsar Monitoring and Alerting Architecture

Prometheus is used to scrape Pulsar metrics, but its built‑in TSDB is not distributed and consumes significant memory and CPU. To offload storage and query load, a remote storage layer (Pulsar or Kafka) is added. The overall architecture is illustrated below.

Pulsar monitoring architecture diagram
Pulsar monitoring architecture diagram

The components function as follows:

Pulsar, BookKeeper and other services expose metric endpoints.

Prometheus scrapes these endpoints.

An adaptor discovers services, deserializes Prometheus‑format metrics, and forwards them to remote storage (Pulsar or Kafka).

Druid consumes the metric topics and provides analytical capabilities.

Vivo's internal alert platform dynamically configures detection rules.

Key Issues and Solutions

1. Adaptor Development for Prometheus Integration

The adaptor needed to ingest all online Pulsar services and support dynamic Prometheus configuration. It implements service discovery and auto‑loading of scrape configs, referencing the official Prometheus documentation for dynamic configuration and custom service discovery.

2. Druid Plugin Enhancement for Counter Metrics

Grafana’s Druid plugin originally struggled with Counter‑type metrics. Vivo’s Druid team added a rate‑calculation aggregation function to improve visualization of such metrics.

3. Prometheus Parameter Tuning

--storage.tsdb.retention=30m : keeps data for 30 minutes only.

--storage.tsdb.min-block-duration=5m : sets the minimum block size to 5 minutes, limiting memory usage.

--storage.tsdb.max-block-duration=5m : caps block size at 5 minutes to avoid large blocks.

--enable-feature=memory-snapshot-on-shutdown : writes an in‑memory snapshot to disk on shutdown for faster restarts.

Pulsar Metric Optimizations

3.1 Subscription Backlog Metric

The native Pulsar backlog metric reports entries, which is confusing for Kafka‑migrated users who expect message counts. By enabling

brokerEntryMetadataInterceptors=org.apache.pulsar.common.intercept.AppendIndexMetadataInterceptor

, Pulsar adds monotonically increasing index metadata to each entry, allowing backlog calculation in message counts per partition.

<code>biz-log-partition-1 -l 24622961 -e 6
Batch Message ID: 24622961:6:0
Publish time: 1676917007607
Event time: 0
Broker entry metadata index: 157398560244
Properties:
"X-Pulsar-batch-size    2431"
"X-Pulsar-num-batch-message    50"</code>

Visualization of the new backlog metric (in messages) versus the native metric (in entries) shows a clearer increase when sending large batches.

Backlog metric comparison
Backlog metric comparison
Backlog metric chart
Backlog metric chart

3.2 Bundle Load Metrics

Pulsar’s automatic load‑balancing operates at the bundle level. To aid tuning, Vivo collects bundle traffic distribution and producer/consumer connection counts, also marking idle (unassigned) bundles.

Bundle load metrics
Bundle load metrics

3.3 kop Consumption Lag Reporting Issue

After restarting the kop Coordinator, consumption‑lag metrics sometimes drop to zero because the Kafka‑lag‑exporter filters out members whose

assignment

field is missing. The root cause is that the new broker does not deserialize the

assignment

when the group metadata is transferred.

Fix: modify

GroupMetadataConstants#readGroupMessageValue()

to correctly read and set the

assignment

field during deserialization.

Kafka lag exporter flow
Kafka lag exporter flow
GroupMetadata deserialization
GroupMetadata deserialization
Assignment field fix
Assignment field fix

Conclusion

Through the construction of a cloud‑native Pulsar observability system, Vivo addressed user‑experience, operational efficiency, and availability challenges, accelerating Pulsar adoption. Nevertheless, single‑point bottlenecks remain, indicating ongoing challenges for future scaling.

MonitoringBig DataObservabilityMetricsPrometheusPulsar
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.