How Qunar Scaled Container Monitoring with VictoriaMetrics: Lessons from Replacing Prometheus
This article details Qunar's migration from Prometheus to VictoriaMetrics for large‑scale container monitoring, covering the shortcomings of Prometheus at massive data volumes, the architectural choices made, performance improvements achieved, and future optimization plans.
Background
In cloud‑native environments with high elasticity, micro‑service splitting and dynamic lifecycles, traditional monitoring systems (e.g., Cacti, Nagios, Zabbix) cannot meet the requirements. Prometheus became the de‑facto standard, but at Qunar’s scale it exhibited critical limitations.
Prometheus limitations
Data ingestion reaches ~100 million samples per minute per cluster, creating severe bottlenecks.
All‑in‑One design prevents horizontal scaling; a single node cannot be split.
High CPU, memory and storage consumption; unsuitable for long‑term retention.
Queries load data from disk into memory, causing OOM on large range queries.
Mitigation attempts
Sharding Prometheus by service dimension.
Running two Prometheus instances behind a load balancer for high availability.
Using Promscale as remote storage for long‑term data.
These measures reduced some pressure but introduced data loss on node failure and still required excessive resources.
Why VictoriaMetrics
VictoriaMetrics (VM) was selected because it offers full Prometheus‑compatible APIs, dramatically lower memory and disk footprints, superior query performance, and a cluster mode that scales horizontally.
Key features of VictoriaMetrics
Compatible with PromQL and adds an enhanced MetricsQL.
Works directly with Grafana via the Prometheus data source.
Query speed outperforms Prometheus.
Memory usage ~5× lower than Prometheus and ~28× lower than Promscale.
Disk compression ~7× lower than Prometheus and ~92× lower than Promscale.
Cluster version provides horizontal scaling, multi‑replica storage, and multi‑tenant support.
VM deployment modes
Single‑server “All‑in‑One” Docker image; a single VM instance can handle up to 1 million data points per second.
Cluster mode composed of three stateless services— vmselect, vminsert, vmstorage —which can be scaled independently. vmstorage uses a shared‑nothing design for high availability.
Supporting components
vmagent : lightweight collector that scrapes exporters, applies relabeling, supports Kafka integration, and writes to VM via remote_write. It can be deployed in HA with duplicate instances.
vmalert : evaluates alerting and recording rules, forwards alerts to Alertmanager, and can remote‑write recording results to VM.
VM Operate CRDs : VMCluster, VMAgent, VMServiceScrape, VMPodScrape, VMRule, VMProbe enable declarative management of the entire stack without modifying existing Prometheus resources.
Qunar production architecture
Qunar runs one VMCluster per data center. vmagent instances are sharded by service and duplicated for HA; each instance remote‑writes to the same VMCluster, which is configured with multi‑replica storage for resilience. Promxy aggregates queries across clusters, providing a single query endpoint. vmalert together with custom Rule/Prometheus managers handles alert evaluation and forwarding.
Observed metrics after migration
Active series: ~28 million
Total datapoints: ~17 trillion
Ingestion rate: ~1.6 million samples per second
Disk usage: ~8 TB
Average query rate: ~450 queries/s
Query latency: median ~300 ms, p99 ~200 ms
Planned optimizations
Implement down‑sampling via vmalert record rules (open‑source VM lacks native down‑sampling).
Perform metric governance to drop unused metrics and further reduce resource consumption.
Conclusion
VictoriaMetrics proved to be a cost‑effective, high‑performance replacement for Prometheus at massive scale, offering better resource efficiency, horizontal scalability, and seamless migration with minimal operational changes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
