Design and Evolution of Vivo Server‑Side Monitoring System
This article systematically outlines the design, components, data flow, and evolution of Vivo’s server‑side monitoring system, covering data collection, transmission, storage with OpenTSDB, visualization, alerting mechanisms, and comparisons with other monitoring solutions.
Business Background
In the era of massive information flow, complex platforms and services increase system complexity, making timely detection of core business issues and server resource problems essential. A robust monitoring system is required to provide real‑time alerts and diagnostics.
Basic Monitoring Workflow
1) Data collection – includes JVM metrics (GC count, thread count, heap sizes), system metrics (disk usage, network traffic, TCP connections), and business metrics (error logs, PV/UV, video playback).
2) Data transmission – data is reported to the monitoring platform via messages or HTTP.
3) Data storage – stored in relational databases (MySQL, Oracle) or time‑series databases such as OpenTSDB, InfluxDB, HBase.
4) Data visualization – metrics are displayed as line, bar, or pie charts.
5) Alerting – flexible rules trigger notifications via email, SMS, IM, etc.
Proper Use of Monitoring
Understanding JVM memory structure and GC mechanisms, defining clear metric definitions, setting reasonable thresholds, and establishing a fault‑handling process are crucial for effective monitoring.
Architecture and Evolution
The article describes the evolution from the early Vivo monitoring architecture to the current design, emphasizing the adoption of OpenTSDB as the core time‑series store due to its scalability, tag support, and Java‑based HTTP API.
OpenTSDB Overview
OpenTSDB stores data points consisting of metric name, tags, value, and timestamp. It uses two main tables (tsdb and tsdb‑uid) on HBase. The row key format is Metric+HourTimestamp+TagKey+TagValue , and qualifiers store the remaining seconds.
Precision Issue Example
String value = "0.51";
float f = Float.parseFloat(value);
int raw = Float.floatToRawIntBits(f);
byte[] float_bytes = Bytes.fromInt(raw);
int raw_back = Bytes.getInt(float_bytes, 0);
double decode = Float.intBitsToFloat(raw_back);
/**
* 打印结果:
* Parsed Float: 0.51
* Encode Raw: 1057132380
* Encode Bytes: 3F028F5C
* Decode Raw: 1057132380
* Decoded Float: 0.5099999904632568
*/
System.out.println("Parsed Float: " + f);
System.out.println("Encode Raw: " + raw);
System.out.println("Encode Bytes: " + UniqueId.uidToString(float_bytes));
System.out.println("Decode Raw: " + raw_back);
System.out.println("Decoded Float: " + decode);This code shows that storing a float value can lose precision when retrieved from OpenTSDB.
Aggregation Function Issue
Most OpenTSDB aggregation functions (sum, avg, max, min) use linear interpolation, which can fill missing points undesirably. Vivo’s vmonitor adds a custom nimavg function together with zimsum to handle empty values.
Vivo Collector Principles
The collector includes three agents: OS collector, JVM collector (both run every minute), and business metric collector (real‑time, aggregated per minute). Data is packaged and sent asynchronously to RabbitMQ.
Business metrics can be collected via log4j filter (non‑intrusive) or code instrumentation (intrusive) using a provided SDK.
Old Architecture
Data was collected by vmonitor‑agent , sent to RabbitMQ, processed by backend services, stored in OpenTSDB (HBase), with configuration and alert data in MySQL, and coordination via Zookeeper and Redis.
New Architecture
The new design replaces RabbitMQ and CDN with vmonitor‑gateway , which receives HTTP reports, validates them, stores data in Redis queues, aggregates, and finally writes to OpenTSDB, reducing single points of failure.
Data Collection and Reporting Strategy
Collectors buffer up to 100 minutes of data locally, then push via HTTP to the gateway. The gateway authenticates, checks for circuit‑breaker status, stores data in a Redis queue (max length 10 000), aggregates, and writes to OpenTSDB. It also returns configuration updates when needed.
Core Metrics and Alert Types
Metrics include system, JVM, and business indicators. Alert types cover max/min thresholds, fluctuation percentages, daily/weekly/hour‑day comparisons, each with explicit calculation formulas.
Demo Effects
The article shows UI screenshots of business metric queries, system/JVM dashboards, and configuration panels, illustrating real‑time refresh, color‑coded health status, and detailed drill‑down.
Comparison with Other Solutions
Zabbix – mature but relies on MySQL, lacks tag‑based multi‑dimensional aggregation.
Open‑Falcon – Go/Python based, easy to extend with custom probes.
Prometheus – Go based, built‑in TSDB, supports tags, but stores data locally.
Vivo vmonitor – Java stack, OpenTSDB backend, custom aggregation functions, multi‑channel alerting, and SDK for easy integration.
Conclusion
The article presents the design and evolution of Vivo’s server‑side monitoring system, built on Java and OpenTSDB, and provides a comparative view of mainstream monitoring tools to aid technology selection.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.