Cloud Native 16 min read

Building a Scalable Container Monitoring System with Prometheus and VictoriaMetrics at vivo

The vivo Internet Container Team built a scalable, high‑availability container monitoring platform by deploying dual‑replica Prometheus clusters with a custom HA adapter, remoteWrite to VictoriaMetrics, and a Kafka forwarder, while cutting metric cardinality, tuning cAdvisor, and upgrading VictoriaMetrics to eliminate data loss and storage spikes, achieving stable, efficient monitoring.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Building a Scalable Container Monitoring System with Prometheus and VictoriaMetrics at vivo

This article, authored by the vivo Internet Container Team (Han Rucheng), introduces the container‑cluster monitoring system built on the cloud‑native monitoring ecosystem (Prometheus, VictoriaMetrics, Kafka, etc.). It describes the background, architecture, encountered problems, and the solutions applied in vivo’s large‑scale container environment.

Background

vivo migrated its services to a container platform, causing a rapid increase in monitoring metrics. The team needed to handle the surge in metric volume, data loss, and backend storage load.

Monitoring Architecture

The architecture consists of a dual‑replica Prometheus cluster collecting exporter data, an adapter layer for high‑availability, remoteWrite to VictoriaMetrics for persistence, and a Kafka‑adapter that forwards data to Kafka for unified monitoring.

Key features:

High availability: dual‑replica Prometheus with multi‑instance adapters performing leader election.

Data persistence: remoteWrite to VictoriaMetrics.

Unified monitoring: data forwarded to Kafka for consumption by other monitoring services.

The native Prometheus lacks a standard HA solution, so the team implemented a custom adapter with group election to achieve deduplication and HA.

Problems Observed

1. Rapid increase of monitoring component load

Metric count grows linearly with Pod number, per‑Pod metrics, and label dimensions: TotalSeries = PodNum * PerPodMetrics * PerLabelCount . This caused high CPU and memory usage in Prometheus and VictoriaMetrics.

2. Monitoring data loss (missing points)

Metrics such as container_cpu_user_seconds_total were expected every 10 s but only 4 values were recorded in a minute, indicating dropped points.

3. Backend storage load spike

VictoriaMetrics exhibited occasional load spikes during indexdb merge operations, leading to increased latency and alerting issues.

Solutions

4.1 Metric Governance

• Filter unused metrics : Use scrape_samples_scraped to identify high‑volume targets and apply regex drops in ServiceMonitor. Example to drop metrics starting with container_threads :

# 过滤 container_threads 开头的指标
- action: drop
  regex: container_threads(.*)
  sourceLabels:
  - __name__

Result: sample count reduced from 10 M to 2.5 M, CPU usage ↓70 %, memory ↓55 %.

• Filter low‑priority Pod metrics : Exclude DaemonSet Pods that only need liveness checks. Example to drop metrics from cadvisor DaemonSet in the monitoring namespace:

# 过滤 monitoring namespace 下 telegraf DaemonSet 的 Pod 指标
- action: drop
  regex: monitoring@telegraf(.*)
  separator: '@'
  sourceLabels:
  - namespace
  - pod

After filtering, cAdvisor data volume dropped 70 %.

4.1.2 Performance Optimization

• Balance Prometheus load: Separate monitoring targets by type (Container, Component, Host, Resource) and run each as a dual‑replica Prometheus. Shift container‑heavy workloads to a less‑loaded Prometheus, reducing container‑Prometheus load by ~40 %.

• Reduce Prometheus local retention: Change retention from 2 weeks to 2 days (since long‑term storage is handled by VictoriaMetrics). This cut memory consumption by ~40 %.

4.2 Fixing Data‑Loss (Missing Points)

Root cause: kubelet‑embedded cAdvisor’s default housekeeping interval (10 s) and dynamic housekeeping caused timestamps to be stale, leading Prometheus to record duplicate points.

Solution: Deploy a dedicated cAdvisor DaemonSet with parameters -allow_dynamic_housekeeping=false and -housekeeping_interval=1s , and configure ServiceMonitor to ignore cAdvisor’s timestamps ( honorTimestamps: false , interval: 10s ). After deployment, the missing‑point issue was resolved, and a full set of 6 data points per minute was observed.

# cAdvisor deployment args
- -allow_dynamic_housekeeping=false
- -housekeeping_interval=1s
spec:
  endpoints:
  - honorLabels: true
    honorTimestamps: false
    interval: 10s

4.3 Backend Database Load Spike

Problem: VictoriaMetrics indexdb merge caused CPU/memory spikes, leading to remote_write latency.

Solution: Upgrade VictoriaMetrics to version 1.73.0, which optimizes indexdb merge performance. After upgrade, no noticeable load spikes were observed.

Conclusion

The team achieved a more stable and efficient monitoring system by reducing metric cardinality, balancing Prometheus workloads, shortening retention, deploying a tuned cAdvisor, and upgrading VictoriaMetrics. Future work will continue to improve data collection, query performance, and exporter efficiency.

Monitoringcloud-nativekubernetescontainerPrometheusVictoriaMetricsMetrics Optimization
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.