How Qunar Scaled Container Monitoring with VictoriaMetrics: A Cloud‑Native Case Study
This article details Qunar's migration from a Prometheus‑based monitoring stack to VictoriaMetrics, describing the limitations they faced, the architectural redesign using vmagent, vmcluster, and vmalert, and the resulting performance improvements and operational benefits for large‑scale Kubernetes environments.
Overview
In cloud‑native environments with high elasticity, micro‑service fragmentation, and dynamic application lifecycles, traditional monitoring tools such as Cacti, Nagios, and Zabbix lose their effectiveness. Qunar initially built a one‑stop monitoring platform based on Graphite + Whisper for storage and Grafana for the UI, but later evaluated Prometheus as an all‑in‑one solution.
Problems with Prometheus in Qunar
Data ingestion volume reaches nearly 100 million container metrics per minute per cluster.
Prometheus does not support horizontal scaling.
Only single‑node deployment is possible; clustering is not supported.
It is unsuitable for long‑term data storage.
High resource consumption.
Query performance degrades with large‑range or inefficient queries, often causing OOM.
Attempted Mitigations
Sharding Prometheus by service dimension, which complicates alert rules because each shard sees a different set of targets.
Running two Prometheus instances behind a load balancer for high availability, but data loss can occur if one instance is down.
Using Promscale as remote storage for long‑term data, which consumes excessive CPU, memory, and disk resources at Qunar's scale.
Introducing VictoriaMetrics
VictoriaMetrics (VM) is a fast, cost‑effective, and scalable time‑series database that can serve as a remote write target for Prometheus or replace Prometheus entirely. It is compatible with Prometheus Config, API, PromQL, exporters, and service discovery, and ranks in the top‑15 TSDBs on DB‑Engines.
Key Features of VictoriaMetrics
Compatible with PromQL and offers an enhanced MetricsQL.
Works with Grafana's Prometheus datasource because it implements the Prometheus API.
Higher query performance than Prometheus.
Uses 5× less memory than Prometheus and 28× less than Promscale.
Provides 7× lower disk usage than Prometheus and 92× lower than Promscale.
Cluster edition supports horizontal scaling, multi‑replica storage, and multi‑tenant isolation.
Architecture
VM can be deployed in two ways:
Single‑server (All‑in‑One) mode with a Docker image, capable of handling up to 1 million data points per second.
Cluster mode, which splits the system into three stateless services— vmselect , vminsert , and vmstorage —and a stateful storage component, allowing horizontal scaling.
Qunar single cluster Total Data points 17 trillion, using VMCluster solution.Deployment at Qunar
Metrics collection is performed by vmagent , partitioned by service dimension with double‑replica deployment for HA.
Data is stored in a VMCluster; each cluster runs its own set of components, isolated by labels, tolerations, and podAntiAffinity.
Promxy aggregates all clusters and serves as the unified query entry point.
Alerting is handled by vmalert , integrated with Qunar's existing alert center via custom Rule Manager and Prometheus Manager modules.
Performance After Migration
Active time series: ~28 million.
Datapoints stored: ~17 trillion.
Ingestion rate: ~1.6 million samples/s.
Disk usage: ~8 TB.
Average query rate: ~450 queries/s.
Median query latency: ~300 ms (p99 ~200 ms).
Future Optimizations
Open‑source VM lacks down‑sampling; plan to use vmalert record rules to emulate down‑sampling.
Implement metric governance to drop unused metrics (e.g., from etcd, node‑exporter) and reduce monitoring overhead.
Conclusion
VictoriaMetrics addresses Prometheus' scalability and storage shortcomings, offering a drop‑in compatible, high‑performance alternative that Qunar successfully deployed at massive scale. Organizations with similar large‑scale monitoring needs can consider VM as a “Prometheus Enterprise” or “Prometheus Plus” solution, while remembering that any architecture must evolve with changing requirements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
