How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads
This article details Vivo's end‑to‑end Pulsar observability solution, covering the challenges of Prometheus‑based monitoring, the architecture of the alerting pipeline, adaptor development, metric optimizations for subscription backlog and bundle load, and fixes for kop lag reporting issues.
Background
Vivo adopted Apache Pulsar as the next‑generation messaging middleware because it offers superior architecture and features compared with Kafka. To ensure reliable performance during research such as load and fault testing, a cloud‑native observability system was essential.
Pulsar Monitoring and Alerting Architecture
Prometheus is used to scrape Pulsar metrics, but its built‑in TSDB is not distributed and consumes significant memory and CPU. To offload storage and query load, a remote storage layer (Pulsar or Kafka) is added. The overall architecture is illustrated below.
The components function as follows:
Pulsar, BookKeeper and other services expose metric endpoints.
Prometheus scrapes these endpoints.
An adaptor discovers services, deserializes Prometheus‑format metrics, and forwards them to remote storage (Pulsar or Kafka).
Druid consumes the metric topics and provides analytical capabilities.
Vivo's internal alert platform dynamically configures detection rules.
Key Issues and Solutions
1. Adaptor Development for Prometheus Integration
The adaptor needed to ingest all online Pulsar services and support dynamic Prometheus configuration. It implements service discovery and auto‑loading of scrape configs, referencing the official Prometheus documentation for dynamic configuration and custom service discovery.
2. Druid Plugin Enhancement for Counter Metrics
Grafana’s Druid plugin originally struggled with Counter‑type metrics. Vivo’s Druid team added a rate‑calculation aggregation function to improve visualization of such metrics.
3. Prometheus Parameter Tuning
--storage.tsdb.retention=30m : keeps data for 30 minutes only.
--storage.tsdb.min-block-duration=5m : sets the minimum block size to 5 minutes, limiting memory usage.
--storage.tsdb.max-block-duration=5m : caps block size at 5 minutes to avoid large blocks.
--enable-feature=memory-snapshot-on-shutdown : writes an in‑memory snapshot to disk on shutdown for faster restarts.
Pulsar Metric Optimizations
3.1 Subscription Backlog Metric
The native Pulsar backlog metric reports entries, which is confusing for Kafka‑migrated users who expect message counts. By enabling
brokerEntryMetadataInterceptors=org.apache.pulsar.common.intercept.AppendIndexMetadataInterceptor, Pulsar adds monotonically increasing index metadata to each entry, allowing backlog calculation in message counts per partition.
<code>biz-log-partition-1 -l 24622961 -e 6
Batch Message ID: 24622961:6:0
Publish time: 1676917007607
Event time: 0
Broker entry metadata index: 157398560244
Properties:
"X-Pulsar-batch-size 2431"
"X-Pulsar-num-batch-message 50"</code>Visualization of the new backlog metric (in messages) versus the native metric (in entries) shows a clearer increase when sending large batches.
3.2 Bundle Load Metrics
Pulsar’s automatic load‑balancing operates at the bundle level. To aid tuning, Vivo collects bundle traffic distribution and producer/consumer connection counts, also marking idle (unassigned) bundles.
3.3 kop Consumption Lag Reporting Issue
After restarting the kop Coordinator, consumption‑lag metrics sometimes drop to zero because the Kafka‑lag‑exporter filters out members whose
assignmentfield is missing. The root cause is that the new broker does not deserialize the
assignmentwhen the group metadata is transferred.
Fix: modify
GroupMetadataConstants#readGroupMessageValue()to correctly read and set the
assignmentfield during deserialization.
Conclusion
Through the construction of a cloud‑native Pulsar observability system, Vivo addressed user‑experience, operational efficiency, and availability challenges, accelerating Pulsar adoption. Nevertheless, single‑point bottlenecks remain, indicating ongoing challenges for future scaling.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.