How Vivo Built a Scalable, Cloud‑Native Monitoring Platform for Millions of Services
This article outlines Vivo's multi‑year journey of designing, evolving, and operating a cloud‑native, AIOps‑enabled monitoring platform that supports tens of thousands of hosts, databases, containers, and services, detailing its architecture, challenges, and future directions for observability and reliability.
Monitoring System Evolution
Initially Vivo used Zabbix with an alert centre for basic monitoring. Rapid growth in services and data volume made this insufficient, prompting a self‑built platform in 2018. By 2019 the platform covered application, log and synthetic monitoring; 2020 added foundational and custom monitoring; 2021 introduced fault‑location and unified alert services; and 2022 merged basic, application and custom monitoring into a unified configuration and detection service. The platform now spans IaaS, PaaS, DaaS and CaaS, reflecting a shift from DevOps to AIOps.
Monitoring Capability Matrix
The monitoring scope is organised into five layers:
Infrastructure layer : network devices, servers and storage, monitored via VGW (Vivo Gateway) and custom checks.
Host layer : physical/virtual machines and containers, collected by agents.
System service layer : databases and big‑data components, monitored by custom checks and alerts.
Business application layer : application services, monitored for link health.
Client‑experience layer : mobile access quality, monitored by the Zeus platform.
Monitoring Object Scope and Data Pipeline
Metrics are collected via SDK/API for custom scenarios and via agents for host‑level data. Collected time‑series data undergoes pre‑aggregation, cleaning and is stored in a TSDB. To handle massive volumes, data reduction, wide‑table design and multi‑dimensional indexing are applied. Detection algorithms include constant‑value, spike, year‑over‑year, no‑data and multi‑metric combination. Detected anomalies generate problems that are merged, claimed, escalated and routed through customizable dashboards. Alert channels support merging, claiming and escalation.
System Scale and Challenges
The platform currently monitors >10,000 host instances and >10,000 database instances, processes billions of metric and log records daily, and handles >100,000 alerts per day with sub‑second latency and high recall. Major challenges are:
Complex deployment environments : millions of hosts and containers across multiple data centres, with many dependencies.
Fragmented platform components : siloed user experience and data hinder unified analysis.
Emerging technology : container‑native monitoring, Prometheus remote storage and rapid data growth require costly experimentation.
Architecture Overview
Product Architecture
The product stack consists of three layers:
Capability service layer : defines collection, detection and alerting capabilities.
Functional layer : implements nine scenario types (host, container, DB, etc.).
Presentation layer : dashboards, log centre and mobile alert handling.
Technical Architecture
Data collection uses agents and SDKs, then flows through Bees‑Bus (a self‑developed high‑availability data bus) into Kafka. Processing has been consolidated to a Kafka‑Stream stack (previously Spark/Flink/KafkaStream). Storage comprises a 190‑node Druid cluster for metric data, VictoriaMetrics as Prometheus remote storage, and a 250‑node Elasticsearch cluster for logs. Unified metadata, configuration and alert services manage rules; Grafana provides self‑monitoring.
Interaction Flow
Unified metadata defines collection rules, which are pushed to VCS‑Master. Agents pull the rules, collect metrics and send them via Bees‑Bus to dual Kafka clusters. ETL jobs consume the streams, clean and compute the data, then write to storage. Detection rules are distributed to the anomaly detection service; detected anomalies are enriched and forwarded through a Kafka‑based pipeline to the unified alert service. All queries pass through a gateway before reaching the storage layer.
Availability Engineering
Availability is driven by reducing MTTR and MTTD: fast alerting, fault‑location services and collaborative incident response between operations, development and monitoring teams. The platform self‑monitors its agents, writes data to two data‑centres for resilience, and uses Grafana for internal health checks.
Future Directions
Cloud‑Native Monitoring (Prometheus)
Each production Kubernetes cluster runs dedicated monitoring nodes. Prometheus instances scrape targets and remote‑write metrics to multi‑replica VictoriaMetrics. Synthetic checks monitor Prometheus health. Container‑level metrics are collected by agents from cAdvisor, sent through Bees‑Bus to dual Kafka clusters, processed and stored in VictoriaMetrics. Dashboards query the data via PromQL.
AIOps – Fault Location
Using CMDB node trees, the system selects analysis nodes and time windows, drills down by component/service, computes variance, applies K‑Means clustering to filter low‑variance groups, and produces a cause‑chain graph highlighting likely abnormal services. The UI shows downstream calls, interfaces and latency.
Observability Dashboard
Metrics, logs and traces are integrated into a unified dashboard. Users can drill from health overviews to service dependencies, domain health, backend distribution and directly access related logs and metrics, enabling rapid root‑cause analysis.
Unified Observability Platform
The goal is a single pane of glass where metrics, logs and traces can be viewed together, transformed and correlated, supporting cross‑dimensional fault detection and analysis.
Capability‑as‑a‑Service
Monitoring capabilities (metrics, charts, alerts) will be exposed via APIs or independent services, enabling integration into CI/CD pipelines and other systems without manual navigation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
