Cloud Native 18 min read

Vivo Container Cluster Monitoring Architecture and Cloud‑Native Observability Practices

Vivo’s cloud‑native monitoring solution combines high‑availability Prometheus clusters, VictoriaMetrics storage, Grafana visualization, and a custom leader‑election adapter to deduplicate data while forwarding metrics to Kafka and OLAP systems, addressing large‑scale performance, scalability, and integration challenges and paving the way for AI‑driven AIOps.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Vivo Container Cluster Monitoring Architecture and Cloud‑Native Observability Practices

Since the promotion of container technology and the adoption of Kubernetes as the de‑facto standard for container orchestration, cloud‑native concepts and architectures have been increasingly applied in production environments. Traditional monitoring systems can no longer meet the elasticity, dynamic lifecycle management, and micro‑service characteristics of cloud‑native workloads, prompting the emergence of a new generation of cloud‑native monitoring solutions.

Prometheus has become the factual standard for cloud‑native monitoring. While it offers powerful query capabilities, easy operation, efficient storage, and straightforward configuration, a single Prometheus instance cannot satisfy the diverse monitoring needs of large‑scale production environments. Therefore, a tailored monitoring architecture is required.

This article is based on Vivo’s experience with container‑cluster monitoring and discusses how to construct a cloud‑native monitoring system, the challenges encountered, and the corresponding countermeasures.

Cloud‑Native Monitoring System

2.1 Features and Value of Cloud‑Native Monitoring

Feature

Value

Monitoring system deployed in a cloud‑native way

Standardized deployment and upgrade

Unified orchestration and scheduling

Elastic scaling

Unified cloud‑native monitoring standards

Standard monitoring interfaces

Standard data formats

Minimal intrusion on business services

Cloud‑native apps expose built‑in metrics

Legacy apps use exporters

Low integration complexity

Integrated cloud‑native design

Fits dynamic container lifecycles

Handles massive monitoring data at container scale

Rich community ecosystem

Abundant open‑source projects with continuous evolution

Strong community support

Production experience from top vendors worldwide

2.2 Overview of the Cloud‑Native Monitoring Ecosystem

The CNCF landscape lists many monitoring projects (see https://landscape.cncf.io/ ). The most relevant ones are:

Prometheus (Graduated) : A powerful monitoring system and time‑series database with a rich query language (PromQL). It supports pull‑based metric collection, but a single instance is limited by local storage and lacks high availability.

Cortex (Incubating) : Extends Prometheus with multi‑tenant support and persistent storage, enabling horizontal scaling and unified queries across multiple Prometheus instances.

Thanos (Incubating) : Stores Prometheus data in object storage for low‑cost long‑term retention, provides a global query view, and uses down‑sampling to accelerate historic queries.

Grafana : Open‑source visualization suite that connects to many data sources (including Prometheus, VictoriaMetrics, etc.) and offers flexible dashboards and alerting.

VictoriaMetrics : High‑performance, cost‑effective time‑series database that can serve as a remote storage for Prometheus, offering high compression, deduplication, and horizontal scalability.

Alertmanager : Handles alerts from Prometheus, applying grouping, silencing, and routing to various receivers (email, Slack, webhook, etc.).

2.3 Building a Simple Cloud‑Native Monitoring Stack

The official Prometheus architecture diagram (source: Prometheus community ) illustrates a basic stack:

The components are deployed as containers managed by Kubernetes. Prometheus collects metrics via file‑based configuration or service discovery, applications expose metrics directly or via exporters, short‑lived custom metrics can be pushed through Pushgateway, alerts are sent to Alertmanager, and Grafana visualizes the data.

2.4 Designing a Capable, Stable, and Efficient Cloud‑Native Monitoring System

While the community‑recommended architecture works, it faces several production‑level issues:

Single‑node Prometheus cannot store large amounts of historic data.

Lacks high availability.

No horizontal scaling.

Insufficient multi‑dimensional analysis capabilities.

To address these, Vivo combines its own container‑cluster monitoring experience with industry best practices and proposes a layered architecture:

Deploy monitoring components natively on Kubernetes.

Use a Prometheus cluster with high‑availability configurations for metric collection.

Route alerts through Alertmanager and a custom webhook to the corporate alert center.

Store long‑term data in a highly available time‑series database (VictoriaMetrics) and also in a data‑warehouse for richer analytics.

Perform advanced data analysis, including machine‑learning‑based fault prediction and self‑healing.

Vivo Container‑Cluster Monitoring Architecture

The design considers production scale (clusters of 1,000–2,000 nodes) and requirements such as high availability, comprehensive metric coverage (cluster, host, and business metrics), visual alert configuration, and extensive reporting.

Each production cluster has dedicated monitoring nodes; Prometheus instances are deployed in pairs for HA.

VictoriaMetrics clusters are deployed per data center; Prometheus remote‑writes to VM, which provides multi‑replica storage.

Grafana runs stateless with MySQL backend and uses VictoriaMetrics as a data source.

Prometheus health is monitored via synthetic checks; failures trigger alerts.

Cluster‑level alerts are configured in Grafana and forwarded via a custom webhook with hierarchical routing.

Business‑level metrics are forwarded by a custom Adapter to Kafka, then stored in the corporate monitoring platform and Druid for multi‑dimensional reports.

Why not adopt Cortex or Thanos? Cortex lacks sufficient public documentation, and Thanos requires object storage and a side‑car deployment that complicates the existing Operator‑based Prometheus setup. Moreover, Thanos introduces CPU and network spikes during compaction. VictoriaMetrics, on the other hand, offers simple deployment, excellent compression, deduplication, and low operational overhead.

3.2 High‑Availability Design of the Data‑Forwarding Layer

Prometheus double‑replica deployment introduces potential duplicate data when forwarding to downstream systems. VictoriaMetrics handles deduplication on the storage side, but data sent to Kafka and OLAP layers also needs deduplication. Vivo’s solution uses a custom Adapter with a “group leader election” mechanism: each Prometheus replica has a pair of Adapters; only the leader group forwards data, ensuring deduplication while leveraging Kubernetes services for load balancing. If the leader Prometheus fails, the leader Adapter steps down and the standby group takes over, preserving the “dual‑replica + deduplication” guarantee.

Challenges and Countermeasures in Container‑Cluster Monitoring

The practice revealed several challenges:

Problem

Challenge

Countermeasure

Large‑scale performance

Uneven load due to manual sharding of Prometheus instances; occasional data loss under high pressure.

Adopt community projects for automatic sharding and load balancing; upgrade Prometheus; reduce OLAP sampling precision.

Time‑series DB performance and scaling

Need to support higher throughput and capacity.

Leverage VictoriaMetrics’ scalability; prune unnecessary metrics.

Adoption of cloud‑native monitoring

Integration with existing corporate monitoring; richer dimensions after full containerization.

Ensure compatibility of data formats; continuously follow ecosystem evolution; educate business teams on monitoring principles and exporter development.

Future Outlook

The ultimate goal of monitoring is efficient, reliable operations, progressing toward automated and intelligent (AIOps) operations. Future enhancements may include:

Automated Prometheus sharding and target load balancing.

AI‑driven fault prediction.

Self‑healing mechanisms.

Data‑driven alert threshold tuning.

Optimized alert governance strategies.

Monitoring architectures must evolve continuously with changing workloads and emerging technologies.

ObservabilityHigh AvailabilityKubernetesPrometheusVictoriaMetricsCloud Native Monitoring
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.