Cloud Native 14 min read

Qunar’s Experience Replacing Prometheus with VictoriaMetrics for Cloud‑Native Container Monitoring

This article details Qunar’s migration from a traditional Prometheus‑based monitoring stack to VictoriaMetrics, describing the challenges of large‑scale container metrics collection, the architectural redesign using VM‑Cluster, vmagent, and vmalert, and the performance improvements achieved after full replacement.

Qunar Tech Salon

Nov 29, 2022

Qunar’s Experience Replacing Prometheus with VictoriaMetrics for Cloud‑Native Container Monitoring

Author: Wang Kun, senior system operations engineer at Qunar (joined 2020), member of the Qunar Cloud‑Native SIG and Infrastructure SIG, responsible for Kubernetes and container metrics monitoring.

Overview: In cloud‑native environments with high elasticity, micro‑service decomposition, and dynamic application lifecycles, traditional monitoring systems such as Cacti, Nagios, and Zabbix lose their advantages. Prometheus has become the de‑facto standard for cloud‑native monitoring, but it also has limitations at massive scale.

Qunar previously built a one‑stop monitoring platform backed by Graphite+Whisper for storage and Grafana for the UI. Although Grafana supports multiple data sources, including Prometheus, the initial plan to adopt Prometheus during containerization faced scalability issues.

Problems with Prometheus (All‑in‑One design):

Data ingestion volume reaches ~100 million samples per minute per cluster.

No native horizontal scaling.

Only single‑node deployment; cannot split components into separate services.

Unsuitable for long‑term data storage.

High resource consumption.

Low query efficiency; large or range queries cause high memory usage and may trigger OOM.

These issues are critical for Qunar’s massive data scale, prompting several mitigation attempts:

Sharding Prometheus by service dimension, which complicates alert rules.

Running two Prometheus instances behind a load balancer for HA.

Using Promscale (based on TimescaleDB) as remote storage for long‑term data.

Even after these measures, limitations remained, such as data loss during instance failures, insufficient resource savings, and excessive storage requirements (e.g., 30 billion samples per day consuming ~16 CPU cores, 40 GB RAM, 150 GB disk).

Introducing VictoriaMetrics (VM): VM is a fast, cost‑effective, and scalable time‑series database that can serve as Prometheus remote‑write storage or fully replace Prometheus. It is compatible with Prometheus Config, API, PromQL, exporters, and service discovery, offering low migration cost.

Key VM features:

Compatible with PromQL and provides an enhanced MetricsQL.

Works directly with Grafana’s Prometheus data source.

Higher query performance than Prometheus.

Memory usage 5× lower than Prometheus and 28× lower than Promscale.

Disk compression 7× lower than Prometheus and 92× lower than Promscale.

Cluster edition supports horizontal scaling, multi‑replica storage, and multi‑tenant isolation.

VM offers two deployment modes:

VM‑Single (All‑in‑One): Docker image for a single‑node instance, capable of handling up to 1 million data points per second.

VM‑Cluster: Consists of three services—vmstorage (stateful storage), vminsert (writes data using consistent hashing), and vmselect (queries data). All services are stateless except vmstorage, enabling independent horizontal scaling.

Qunar single‑cluster Total Data points 17 trillion, using VMCluster solution.

Additional VM components:

vmagent: Lightweight metric collector that can replace Prometheus for scraping exporters and supports remote_write to any Prometheus‑compatible storage.

vmalert: Handles alert and recording rules, compatible with Prometheus rule format, and integrates with Alertmanager.

Both vmagent and vmalert are managed via the VM‑Operate CRDs (VMCluster, VMAgent, VMServiceScrape, VMPodScrape, VMRule, VMProbe), allowing seamless migration from existing Prometheus resources.

Qunar’s VM Architecture:

Metrics are collected by vmagent instances grouped by service dimension, each deployed with dual replicas for HA.

Data is stored in VM‑Cluster; each cluster runs a dedicated set of pods with anti‑affinity to ensure node isolation.

Promxy aggregates all clusters, providing a unified query endpoint.

Watcher’s Prometheus data source points to Promxy.

Alerting uses vmalert, with custom Rule Manager and Prometheus Manager modules for rule synchronization and alert state handling.

After fully replacing Prometheus, Qunar observed the following performance metrics (per cluster):

Active time series

~28 million

Datapoints

~17 trillion

Ingestion rate

~1.6 million/s

Disk usage

~8 TB

Average query rate

~450 /s

Query duration (median/p99)

~300 ms / ~200 ms

Planned future optimizations include:

Implementing down‑sampling via vmalert recording rules (open‑source VM lacks native down‑sampling).

Metric governance to prune unused metrics (e.g., from etcd, node‑exporter) and reduce monitoring overhead.

Conclusion: VictoriaMetrics addresses Prometheus’s scalability and resource challenges, offering a viable replacement for large‑scale container monitoring. Qunar’s migration demonstrates significant improvements in storage efficiency, query performance, and operational simplicity, and the approach can be considered by other organizations with similar monitoring demands.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes Prometheus time_series_database VictoriaMetrics

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.