Vivo Server Monitoring System Architecture and Evolution: A Comprehensive Technical Guide
Vivo’s vmonitor system replaces its legacy RabbitMQ‑based pipeline with an HTTP‑driven collector and gateway, stores minute‑level JVM, system, and business metrics in a customized OpenTSDB on HBase, adds precise floating‑point handling and null‑aware aggregation, buffers data in Redis, and provides multi‑dimensional alerts comparable to Zabbix, Open‑Falcon, and Prometheus.
This article provides an in-depth exploration of Vivo's server-side monitoring system (vmonitor), covering its design principles, architecture evolution, and practical implementation. The monitoring system aims to deliver comprehensive data monitoring for server-side applications, including system monitoring, JVM monitoring, and custom business metric monitoring, complemented by real-time, multi-dimensional alerting services.
1. Monitoring System Basic Flow
The fundamental workflow consists of five key stages: data collection (JVM metrics like GC counts, thread counts, memory region sizes; system metrics like disk usage, network traffic, TCP connections; business metrics like error logs, access logs, video playback counts), data transmission (via message queues or HTTP protocols), data storage (using time-series databases like OpenTSDB or InfluxDB), data visualization (line charts, bar charts, pie charts), and monitoring alerts (supporting email, SMS, IM notifications).
2. OpenTSDB Time-Series Database
OpenTSDB is a distributed, scalable time-series database built on HBase, designed specifically for monitoring scenarios. It supports second-level data collection, permanent storage, and easy integration with existing monitoring systems. The storage structure uses Data Points comprising Metric (monitoring indicator name), Tags (labels for dimensioning like machine names), Value (numeric value), and Timestamp. Vivo uses OpenTSDB with specific strategies: direct HBase connection via client, disabling compact action threads, and batch writing to OpenTSDB every 10 seconds using Redis buffer.
3. Key Technical Challenges
Precision issues arise when storing floating-point data in OpenTSDB - storing "0.51" may retrieve "0.5099999904632568". Additionally, most OpenTSDB aggregation functions (sum, avg, max, min) use linear interpolation (LERP), which fills gaps for missing values, making it unsuitable for scenarios requiring explicit null handling. Vivo addressed this by modifying OpenTSDB source code to add the nimavg function, working with the built-in zimsum function to meet null-value insertion requirements.
4. Architecture Evolution
The old version used RabbitMQ for data transmission and CDN for configuration synchronization, with potential failure points affecting the entire monitoring system. The new version (vmonitor) introduces vmonitor-collector for data collection and vmonitor-gateway as a monitoring data gateway, using HTTP for data reporting and configuration retrieval, eliminating dependencies on RabbitMQ and CDN synchronization.
5. Data Collection Strategy
The collector (vmonitor-collector) collects data every minute, compresses it, stores it in a local queue (maximum 100 minutes), and reports via HTTP to the gateway. The gateway performs authentication, validates configuration versions, stores data in Redis queues, performs decompression and aggregation, then persists to OpenTSDB/HBase. This architecture maximizes data preservation during failures through queue-based buffering between collection and proxy layers.
6. Alert Types and Calculation Formulas
The system supports multiple alert types: Maximum (triggers when exceeding threshold), Minimum (triggers when below threshold), Fluctuation (compares max/min with 15-minute average), Daily comparison (compares with same time yesterday), Weekly comparison (compares with same time last week), and Hourly daily comparison (compares current hour sum with same hour yesterday).
float rate = (float) (max - avg) / (float) avg; // Upward fluctuation
float rate = (float) (avg - min) / (float) avg; // Downward fluctuation
float rate = (float) (anHourTodaySum - anHourYesterdaySum) / (float) anHourYesterdaySum; // Hourly comparison
7. Comparison with Mainstream Monitoring Tools
Compared with Zabbix (mature but limited by MySQL for large-scale data, lacks Tag support for multi-dimensional aggregation), Open-Falcon (Xiaomi's open-source solution, easy integration via proxy-gateway), and Prometheus (Google BorgMon open-source version, local time-series database, simple architecture), Vivo's vmonitor offers customized modifications to OpenTSDB, comprehensive JVM/system/business monitoring, and robust alerting capabilities with multi-channel notifications.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.