How Vivo Scaled Container Monitoring with Prometheus, Kafka, and VictoriaMetrics
This article details how Vivo's container platform faced exploding metric volumes, component overload, data gaps, and storage spikes, and explains the step‑by‑step architectural redesign, metric governance, performance tuning, cAdvisor redeployment, and VictoriaMetrics upgrade that restored high‑availability, low‑latency monitoring across a large Kubernetes fleet.
Background
Vivo migrated its services to a container platform, causing the number of monitored metrics to increase by several orders of magnitude. The existing Prometheus‑based monitoring stack could not keep up with the rapid growth in time‑series data.
Monitoring Architecture
The architecture consists of a dual‑replica Prometheus layer that scrapes exporters, an adapter that performs group‑based leader election to achieve HA, remoteWrite to VictoriaMetrics for persistent storage, and a Kafka‑adapter that forwards data to internal monitoring services.
High Availability: Two Prometheus instances per cluster, each paired with an adapter group; only the leader group forwards data.
Data Persistence: RemoteWrite sends samples to VictoriaMetrics, which serves as the Grafana data source.
Unified Monitoring: RemoteWrite also pushes data to Kafka; downstream services consume the same stream for alerts and dashboards.
Observed Problems
Three major symptoms emerged as the platform grew:
Rapid load increase on monitoring components – metric count scales linearly with TotalSeries = PodNum * PerPodMetrics * PerLabelCount, exhausting CPU and memory on Prometheus and VictoriaMetrics.
Data gaps (missing points) – a 10 s scrape interval should yield six points per minute, but only four were observed for container_cpu_user_seconds_total, indicating dropped samples.
Sudden load spikes in the backend database – VictoriaMetrics v1.59.1‑cluster showed intermittent CPU/memory spikes during indexdb merge operations, causing remote_write latency.
Solution Overview
The remediation strategy addressed each symptom through metric governance, performance optimization, cAdvisor redeployment, and a VictoriaMetrics version upgrade.
4.1 Metric Governance
4.1.1 Filter Unused Metrics
Using scrape_samples_scraped to identify high‑volume targets, the team wrote regular‑expression drop rules in the ServiceMonitor for the cAdvisor target, removing metrics prefixed with container_threads:
# Drop metrics starting with container_threads
- action: drop
regex: container_threads(.*)
sourceLabels:
- __name__After applying the filters, sample count dropped from 10 M to 2.5 M per scrape, reducing Prometheus CPU by 70 % and memory by 55 %.
4.1.2 Filter Low‑Priority DaemonSet Pods
Since DaemonSet pods contributed ~70 % of pod‑level metrics, the team excluded memory and CPU metrics for those pods by matching namespace and pod name patterns:
# Drop metrics from telegraf DaemonSet in monitoring namespace
- action: drop
regex: monitoring@telegraf(.*)
separator: '@'
sourceLabels:
- namespace
- podThis reduced cAdvisor’s per‑scrape data volume by 70 %.
4.2 Performance Optimization
4.2.1 Balance Prometheus Load
Prometheus‑Operator was used to manage Prometheus instances. The team split targets by type, assigning container‑heavy workloads to dedicated Prometheus instances and moving some kubelet and kube‑state‑metrics targets to a “resource‑Prometheus”. This rebalancing cut the container‑Prometheus load by ~40 % and eliminated frequent restarts.
4.2.2 Reduce Prometheus Storage Retention
The default 2‑week retention caused high memory usage. The team shortened local storage to 2 days (while still remote‑writing to VictoriaMetrics). This cut Prometheus memory consumption by 40 %.
4.3 Fixing Data Gaps (cAdvisor Issue)
Investigation showed that the built‑in cAdvisor in kubelet refreshed metrics every 10 s (default) and sometimes did not update between Prometheus scrapes, causing duplicate timestamps and apparent missing points. The team deployed a dedicated cAdvisor DaemonSet with modified flags:
# Disable dynamic housekeeping and set interval to 1s
- -allow_dynamic_housekeeping=false
- -housekeeping_interval=1sThey also configured the ServiceMonitor to ignore cAdvisor’s own timestamps:
spec:
endpoints:
- honorLabels: true
honorTimestamps: false
interval: 10sAfter deployment, the minute‑long series for container_cpu_user_seconds_total consistently showed six points, confirming the gap was resolved.
4.4 Backend Database Load Spike Mitigation
Analysis of VictoriaMetrics revealed that indexdb merge operations caused periodic CPU/memory spikes, leading to remote_write delays. Upgrading to VictoriaMetrics 1.73.0, which optimizes merge behavior, eliminated the spikes.
Conclusion
By filtering unused metrics, dropping low‑priority DaemonSet data, rebalancing Prometheus instances, shortening storage retention, redeploying cAdvisor with tuned parameters, and upgrading VictoriaMetrics, Vivo achieved a stable, high‑availability monitoring stack capable of handling the massive metric volume generated by its containerized workloads. Future work will continue to improve data collection, query performance, and data provision layers.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
