Vivo Monitoring Platform: Architecture, Evolution, and Future Directions
The article details the evolution, architecture, capabilities, challenges, and future plans of Vivo's comprehensive monitoring platform, covering its transition from simple Zabbix setups to a cloud‑native, AI‑ops enabled system that ensures service availability across massive infrastructure.
This document summarizes the content of Chen Ningning's talk at the 2022 Vivo Developer Conference, describing the development and current state of Vivo's self‑built monitoring platform, which supports full‑scene availability assurance for a massive user base.
1. Monitoring System Evolution – Initially using Zabbix with an alarm center, Vivo expanded to custom application, log, and probing monitors in 2018, added foundational and custom monitors in 2019, and introduced fault‑location and unified alarm platforms in 2021. Since 2022 a unified monitoring platform consolidates configuration and detection services, covering IaaS, PaaS, DaaS, and CaaS layers and moving from DevOps toward AIOps.
2. Capability Matrix & Object Scope – Monitoring objects are divided into five layers: infrastructure (network devices, servers, storage), host layer (physical/virtual machines, containers), system services (databases, big‑data components), business applications, and client‑experience layer (access quality via the Zeus platform). Data collection uses SDK/API for custom monitors and agents for host metrics, with preprocessing, aggregation, and storage in a TSDB.
3. System Scale & Challenges – The platform currently monitors hundreds of thousands of hosts and DB instances, processes trillions of metric and log records daily, and handles hundreds of thousands of alerts with sub‑second latency. Challenges include complex deployment environments, fragmented platforms, and emerging technologies such as Prometheus storage and container monitoring.
4. Architecture – Product architecture defines collection, detection, and alarm capabilities across nine scenario categories (host, container, DB, etc.) with dashboards and a log center. Technical architecture consists of collection, computation, storage, and visualization layers; data flows through a self‑developed Bees‑Bus, Kafka, and finally into Druid and VictoriaMetrics for storage, while Grafana provides self‑monitoring.
5. Interaction Flow – Unified metadata service distributes collection rules to VCS‑Master, which dispatches tasks to agents. Collected metrics are double‑written via Bees‑Bus to Kafka, processed by ETL, cleaned, and stored. Detection rules are applied by a unified configuration service, and alarms are enriched and routed through a unified alarm service.
6. Availability System – The platform focuses on MTTD and MTTR, providing tools for fault prevention, detection, and post‑mortem analysis. It integrates alert merging, assignment, and escalation, and plans to incorporate intelligent detection for zero‑configuration anomaly identification.
7. Cloud‑Native & AIOps Practices – Container monitoring uses Prometheus with VictoriaMetrics as remote storage; agents pull metrics from cAdvisor, send them through Bees‑Bus to Kafka, and detection services evaluate them. AIOps fault‑location leverages CMDB topology, variance analysis, K‑Means clustering, and generates cause‑chain graphs to pinpoint abnormal services.
8. Observability & Unified Platform – Combining metrics, logs, and traces, Vivo builds an observability dashboard that shows service health, dependencies, and detailed logs/metrics. Future work aims to tightly integrate alerting with fault‑location, automate event creation, and connect with CMDB for richer data.
9. Capability Serviceization – Monitoring capabilities (metrics, charts, alerts) will be exposed as APIs or independent services, enabling downstream systems (e.g., CI/CD pipelines) to consume real‑time monitoring data without leaving their workflow.
Overall, Vivo's monitoring platform illustrates a transition from simple, fragmented tools to a unified, cloud‑native, AI‑enhanced observability system that underpins large‑scale service reliability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
