Cloud Native 18 min read

How Vivo Scaled Container Monitoring with Prometheus, Kafka, and VictoriaMetrics

This article details how Vivo's container platform faced exploding metric volumes, component overload, data gaps, and storage spikes, and explains the step‑by‑step architectural redesign, metric governance, performance tuning, cAdvisor redeployment, and VictoriaMetrics upgrade that restored high‑availability, low‑latency monitoring across a large Kubernetes fleet.

Architect

Sep 7, 2023

How Vivo Scaled Container Monitoring with Prometheus, Kafka, and VictoriaMetrics

Background

Vivo migrated its services to a container platform, causing the number of monitored metrics to increase by several orders of magnitude. The existing Prometheus‑based monitoring stack could not keep up with the rapid growth in time‑series data.

Monitoring Architecture

The architecture consists of a dual‑replica Prometheus layer that scrapes exporters, an adapter that performs group‑based leader election to achieve HA, remoteWrite to VictoriaMetrics for persistent storage, and a Kafka‑adapter that forwards data to internal monitoring services.

High Availability: Two Prometheus instances per cluster, each paired with an adapter group; only the leader group forwards data.

Data Persistence: RemoteWrite sends samples to VictoriaMetrics, which serves as the Grafana data source.

Unified Monitoring: RemoteWrite also pushes data to Kafka; downstream services consume the same stream for alerts and dashboards.

Observed Problems

Three major symptoms emerged as the platform grew:

Rapid load increase on monitoring components – metric count scales linearly with TotalSeries = PodNum * PerPodMetrics * PerLabelCount, exhausting CPU and memory on Prometheus and VictoriaMetrics.

Data gaps (missing points) – a 10 s scrape interval should yield six points per minute, but only four were observed for container_cpu_user_seconds_total, indicating dropped samples.

Sudden load spikes in the backend database – VictoriaMetrics v1.59.1‑cluster showed intermittent CPU/memory spikes during indexdb merge operations, causing remote_write latency.

Solution Overview

The remediation strategy addressed each symptom through metric governance, performance optimization, cAdvisor redeployment, and a VictoriaMetrics version upgrade.

4.1 Metric Governance

4.1.1 Filter Unused Metrics

Using scrape_samples_scraped to identify high‑volume targets, the team wrote regular‑expression drop rules in the ServiceMonitor for the cAdvisor target, removing metrics prefixed with container_threads:

# Drop metrics starting with container_threads
- action: drop
  regex: container_threads(.*)
  sourceLabels:
  - __name__

After applying the filters, sample count dropped from 10 M to 2.5 M per scrape, reducing Prometheus CPU by 70 % and memory by 55 %.

4.1.2 Filter Low‑Priority DaemonSet Pods

Since DaemonSet pods contributed ~70 % of pod‑level metrics, the team excluded memory and CPU metrics for those pods by matching namespace and pod name patterns:

# Drop metrics from telegraf DaemonSet in monitoring namespace
- action: drop
  regex: monitoring@telegraf(.*)
  separator: '@'
  sourceLabels:
  - namespace
  - pod

This reduced cAdvisor’s per‑scrape data volume by 70 %.

4.2 Performance Optimization

4.2.1 Balance Prometheus Load

Prometheus‑Operator was used to manage Prometheus instances. The team split targets by type, assigning container‑heavy workloads to dedicated Prometheus instances and moving some kubelet and kube‑state‑metrics targets to a “resource‑Prometheus”. This rebalancing cut the container‑Prometheus load by ~40 % and eliminated frequent restarts.

4.2.2 Reduce Prometheus Storage Retention

The default 2‑week retention caused high memory usage. The team shortened local storage to 2 days (while still remote‑writing to VictoriaMetrics). This cut Prometheus memory consumption by 40 %.

4.3 Fixing Data Gaps (cAdvisor Issue)

Investigation showed that the built‑in cAdvisor in kubelet refreshed metrics every 10 s (default) and sometimes did not update between Prometheus scrapes, causing duplicate timestamps and apparent missing points. The team deployed a dedicated cAdvisor DaemonSet with modified flags:

# Disable dynamic housekeeping and set interval to 1s
- -allow_dynamic_housekeeping=false
- -housekeeping_interval=1s

They also configured the ServiceMonitor to ignore cAdvisor’s own timestamps:

spec:
  endpoints:
  - honorLabels: true
    honorTimestamps: false
    interval: 10s

After deployment, the minute‑long series for container_cpu_user_seconds_total consistently showed six points, confirming the gap was resolved.

4.4 Backend Database Load Spike Mitigation

Analysis of VictoriaMetrics revealed that indexdb merge operations caused periodic CPU/memory spikes, leading to remote_write delays. Upgrading to VictoriaMetrics 1.73.0, which optimizes merge behavior, eliminated the spikes.

Conclusion

By filtering unused metrics, dropping low‑priority DaemonSet data, rebalancing Prometheus instances, shortening storage retention, redeploying cAdvisor with tuned parameters, and upgrading VictoriaMetrics, Vivo achieved a stable, high‑availability monitoring stack capable of handling the massive metric volume generated by its containerized workloads. Future work will continue to improve data collection, query performance, and data provision layers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Cloud Native Observability Kubernetes prometheus VictoriaMetrics cAdvisor

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.