Operations 19 min read

Design and Evolution of Vivo Server‑Side Monitoring System

This article systematically outlines the design, components, data flow, and evolution of Vivo’s server‑side monitoring system, covering data collection, transmission, storage with OpenTSDB, visualization, alerting mechanisms, and comparisons with other monitoring solutions.

Architecture Digest

Jul 2, 2022

Design and Evolution of Vivo Server‑Side Monitoring System

Business Background

In the era of massive information flow, complex platforms and services increase system complexity, making timely detection of core business issues and server resource problems essential. A robust monitoring system is required to provide real‑time alerts and diagnostics.

Basic Monitoring Workflow

1) Data collection – includes JVM metrics (GC count, thread count, heap sizes), system metrics (disk usage, network traffic, TCP connections), and business metrics (error logs, PV/UV, video playback).

2) Data transmission – data is reported to the monitoring platform via messages or HTTP.

3) Data storage – stored in relational databases (MySQL, Oracle) or time‑series databases such as OpenTSDB, InfluxDB, HBase.

4) Data visualization – metrics are displayed as line, bar, or pie charts.

5) Alerting – flexible rules trigger notifications via email, SMS, IM, etc.

Proper Use of Monitoring

Understanding JVM memory structure and GC mechanisms, defining clear metric definitions, setting reasonable thresholds, and establishing a fault‑handling process are crucial for effective monitoring.

Architecture and Evolution

The article describes the evolution from the early Vivo monitoring architecture to the current design, emphasizing the adoption of OpenTSDB as the core time‑series store due to its scalability, tag support, and Java‑based HTTP API.

OpenTSDB Overview

OpenTSDB stores data points consisting of metric name, tags, value, and timestamp. It uses two main tables (tsdb and tsdb‑uid) on HBase. The row key format is Metric+HourTimestamp+TagKey+TagValue, and qualifiers store the remaining seconds.

Precision Issue Example

String value = "0.51";
float f = Float.parseFloat(value);
int raw = Float.floatToRawIntBits(f);
byte[] float_bytes = Bytes.fromInt(raw);
int raw_back = Bytes.getInt(float_bytes, 0);
double decode = Float.intBitsToFloat(raw_back);
/**
 * 打印结果：
 * Parsed Float: 0.51
 * Encode Raw: 1057132380
 * Encode Bytes: 3F028F5C
 * Decode Raw: 1057132380
 * Decoded Float: 0.5099999904632568
 */
System.out.println("Parsed Float: " + f);
System.out.println("Encode Raw: " + raw);
System.out.println("Encode Bytes: " + UniqueId.uidToString(float_bytes));
System.out.println("Decode Raw: " + raw_back);
System.out.println("Decoded Float: " + decode);

This code shows that storing a float value can lose precision when retrieved from OpenTSDB.

Aggregation Function Issue

Most OpenTSDB aggregation functions (sum, avg, max, min) use linear interpolation, which can fill missing points undesirably. Vivo’s vmonitor adds a custom nimavg function together with zimsum to handle empty values.

Vivo Collector Principles

The collector includes three agents: OS collector, JVM collector (both run every minute), and business metric collector (real‑time, aggregated per minute). Data is packaged and sent asynchronously to RabbitMQ.

Business metrics can be collected via log4j filter (non‑intrusive) or code instrumentation (intrusive) using a provided SDK.

Old Architecture

Data was collected by vmonitor‑agent, sent to RabbitMQ, processed by backend services, stored in OpenTSDB (HBase), with configuration and alert data in MySQL, and coordination via Zookeeper and Redis.

New Architecture

The new design replaces RabbitMQ and CDN with vmonitor‑gateway, which receives HTTP reports, validates them, stores data in Redis queues, aggregates, and finally writes to OpenTSDB, reducing single points of failure.

Data Collection and Reporting Strategy

Collectors buffer up to 100 minutes of data locally, then push via HTTP to the gateway. The gateway authenticates, checks for circuit‑breaker status, stores data in a Redis queue (max length 10 000), aggregates, and writes to OpenTSDB. It also returns configuration updates when needed.

Core Metrics and Alert Types

Metrics include system, JVM, and business indicators. Alert types cover max/min thresholds, fluctuation percentages, daily/weekly/hour‑day comparisons, each with explicit calculation formulas.

Demo Effects

The article shows UI screenshots of business metric queries, system/JVM dashboards, and configuration panels, illustrating real‑time refresh, color‑coded health status, and detailed drill‑down.

Comparison with Other Solutions

Zabbix – mature but relies on MySQL, lacks tag‑based multi‑dimensional aggregation.

Open‑Falcon – Go/Python based, easy to extend with custom probes.

Prometheus – Go based, built‑in TSDB, supports tags, but stores data locally.

Vivo vmonitor – Java stack, OpenTSDB backend, custom aggregation functions, multi‑channel alerting, and SDK for easy integration.

Conclusion

The article presents the design and evolution of Vivo’s server‑side monitoring system, built on Java and OpenTSDB, and provides a comparative view of mainstream monitoring tools to aid technology selection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring architecture Operations Alerting Server OpenTSDB

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.