How Vivo’s Server‑Side Monitoring Evolved: Architecture, Data Flow, and Alert Strategies
This article provides a comprehensive overview of Vivo's server‑side monitoring system, detailing its architecture evolution, data collection pipelines, OpenTSDB storage design, alerting mechanisms, and comparisons with other mainstream monitoring solutions, offering practical guidance for technology selection and implementation.
In an era of massive information flow, increasing system complexity demands effective monitoring to ensure user experience and operational stability. The article systematically outlines the principles and architectural evolution of Vivo's server‑side monitoring, which integrates system, JVM, and custom business metrics into a unified platform offering real‑time, multi‑dimensional alerts.
1. Basic Monitoring Workflow
The monitoring process, whether using open‑source or proprietary solutions, follows five core steps:
Data collection : gathers JVM metrics (GC count, thread count, heap sizes), system metrics (disk usage, network traffic, TCP connections), and business metrics (error logs, PV/UV, video playback counts).
Data transmission : reports collected data via messages or HTTP.
Data storage : persists data in relational databases (MySQL, Oracle) or time‑series stores (OpenTSDB, InfluxDB, HBase).
Data visualization : renders metrics as line, bar, or pie charts.
Alerting : configurable alerts delivered via email, SMS, IM, etc.
2. Proper Use of the Monitoring System
Before deployment, users must understand the monitored object's fundamentals (e.g., JVM memory layout and GC mechanisms), define clear metric definitions, set appropriate alert thresholds, and establish a fault‑handling workflow to respond promptly to alerts.
3. OpenTSDB as the Core Time‑Series Store
OpenTSDB was chosen for its simplicity, scalability, and Java‑based HTTP API. Key reasons include:
Metrics have unique timestamps without complex relationships.
Metrics evolve over time, fitting a time‑series model.
Built on HBase, it offers high throughput and horizontal scalability.
Open source, Java implementation, easy to modify.
Data points consist of Metric, Tags (e.g., host name), Value, and Timestamp. Two main tables store the data: tsdb (raw metric points) and tsdb‑uid (metadata mappings). Row keys combine metric, hour‑aligned timestamp, and tag key/value; column qualifiers store the remaining seconds and the metric value.
4. OpenTSDB Practical Considerations
Precision issues arise when storing floating‑point values (e.g., "0.51" becomes "0.5099999904632568"). Aggregation functions use linear interpolation, which can introduce gaps for missing data. Vivo's customized OpenTSDB adds a nimavg function and leverages zimsum to handle null values.
5. Vivo Monitoring Collectors
The collector suite includes three agents:
OS collector and JVM collector : run every minute, aggregate data, and push to RabbitMQ.
Business metric collector : captures logs via Log4j filters or intrusive code hooks, aggregates per minute, and also pushes to RabbitMQ.
Configuration is refreshed from CDN every five minutes, and multiple aggregators (count, sum, avg, max, min) are available.
6. Legacy Architecture (vmonitor‑agent → RabbitMQ → OpenTSDB)
Data flow: agents send metrics to RabbitMQ, the backend consumes them, stores in OpenTSDB (HBase), while MySQL holds alert and configuration data. Zookeeper and Redis coordinate distributed tasks.
7. New Architecture (vmonitor‑collector → HTTP → vmonitor‑gateway → OpenTSDB)
The newer design replaces RabbitMQ and CDN with an HTTP gateway, reducing single points of failure. The gateway authenticates data, handles circuit‑breaker logic, buffers in Redis, and finally writes to OpenTSDB.
8. Alert Types and Formulas
Supported alert calculations include max/min thresholds, fluctuation percentages (upward, downward, range), daily/weekly/hourly comparisons, and custom formulas such as:
float rate = (float) (max - avg) / avg; // upward fluctuation
float rate = (float) (avg - min) / avg; // downward fluctuation
float rate = (float) (max - min) / max; // range fluctuation9. Demonstration UI
The UI allows querying business and system metrics, supports auto‑refresh, visual cues for missing data, and detailed drill‑down on JVM/system charts.
10. Comparison with Other Monitoring Solutions
Zabbix : mature, C/PHP stack, uses MySQL, lacks tag‑based multi‑dimensional aggregation.
Open‑Falcon : Go/Python, high availability, supports custom instrumentation via proxy‑gateway.
Prometheus : Go‑based, built‑in TSDB, tag support, single‑node simplicity, handles millions of metrics.
vmonitor (Vivo’s solution): Java stack, OpenTSDB backend with custom extensions, SDK for business metric integration, multi‑layer collector‑gateway architecture.
11. Conclusion
The article concludes that Vivo's monitoring platform provides a real‑time, scalable solution for JVM, system, and business metrics, while also offering insights into industry‑standard tools to aid technology selection.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
