Operations 13 min read

How Qunar Scaled Container Monitoring with VictoriaMetrics: Lessons from Replacing Prometheus

This article details Qunar's migration from Prometheus to VictoriaMetrics for large‑scale container monitoring, covering the shortcomings of Prometheus at massive data volumes, the architectural choices made, performance improvements achieved, and future optimization plans.

dbaplus Community
dbaplus Community
dbaplus Community
How Qunar Scaled Container Monitoring with VictoriaMetrics: Lessons from Replacing Prometheus

Background

In cloud‑native environments with high elasticity, micro‑service splitting and dynamic lifecycles, traditional monitoring systems (e.g., Cacti, Nagios, Zabbix) cannot meet the requirements. Prometheus became the de‑facto standard, but at Qunar’s scale it exhibited critical limitations.

Prometheus limitations

Data ingestion reaches ~100 million samples per minute per cluster, creating severe bottlenecks.

All‑in‑One design prevents horizontal scaling; a single node cannot be split.

High CPU, memory and storage consumption; unsuitable for long‑term retention.

Queries load data from disk into memory, causing OOM on large range queries.

Mitigation attempts

Sharding Prometheus by service dimension.

Running two Prometheus instances behind a load balancer for high availability.

Using Promscale as remote storage for long‑term data.

These measures reduced some pressure but introduced data loss on node failure and still required excessive resources.

Why VictoriaMetrics

VictoriaMetrics (VM) was selected because it offers full Prometheus‑compatible APIs, dramatically lower memory and disk footprints, superior query performance, and a cluster mode that scales horizontally.

Key features of VictoriaMetrics

Compatible with PromQL and adds an enhanced MetricsQL.

Works directly with Grafana via the Prometheus data source.

Query speed outperforms Prometheus.

Memory usage ~5× lower than Prometheus and ~28× lower than Promscale.

Disk compression ~7× lower than Prometheus and ~92× lower than Promscale.

Cluster version provides horizontal scaling, multi‑replica storage, and multi‑tenant support.

VM deployment modes

Single‑server “All‑in‑One” Docker image; a single VM instance can handle up to 1 million data points per second.

Cluster mode composed of three stateless services— vmselect, vminsert, vmstorage —which can be scaled independently. vmstorage uses a shared‑nothing design for high availability.

Supporting components

vmagent : lightweight collector that scrapes exporters, applies relabeling, supports Kafka integration, and writes to VM via remote_write. It can be deployed in HA with duplicate instances.

vmalert : evaluates alerting and recording rules, forwards alerts to Alertmanager, and can remote‑write recording results to VM.

VM Operate CRDs : VMCluster, VMAgent, VMServiceScrape, VMPodScrape, VMRule, VMProbe enable declarative management of the entire stack without modifying existing Prometheus resources.

Qunar production architecture

Qunar runs one VMCluster per data center. vmagent instances are sharded by service and duplicated for HA; each instance remote‑writes to the same VMCluster, which is configured with multi‑replica storage for resilience. Promxy aggregates queries across clusters, providing a single query endpoint. vmalert together with custom Rule/Prometheus managers handles alert evaluation and forwarding.

Observed metrics after migration

Active series: ~28 million

Total datapoints: ~17 trillion

Ingestion rate: ~1.6 million samples per second

Disk usage: ~8 TB

Average query rate: ~450 queries/s

Query latency: median ~300 ms, p99 ~200 ms

Planned optimizations

Implement down‑sampling via vmalert record rules (open‑source VM lacks native down‑sampling).

Perform metric governance to drop unused metrics and further reduce resource consumption.

Conclusion

VictoriaMetrics proved to be a cost‑effective, high‑performance replacement for Prometheus at massive scale, offering better resource efficiency, horizontal scalability, and seamless migration with minimal operational changes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringcloud-nativeKubernetesPrometheusscalingVictoriaMetricsTime Series
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.