How to Keep VictoriaMetrics Stable During Sudden Metric Surges
This article outlines practical strategies for protecting VictoriaMetrics storage under bursty metric traffic, covering communication with business teams, splitting deployments, choosing single‑node versus cluster setups, key monitoring metrics, separate storage for self‑monitoring, the VMUI Explore UI, and techniques for discarding high‑cardinality metrics.
1. Communicate with Business
Frequent, unexpected spikes in metric volume often stem from testing environments or ad‑hoc deployments. Teams should inform the platform in advance when they plan to generate large numbers of new metrics, otherwise the storage layer must implement its own safeguards.
2. Split VictoriaMetrics by Business
If the total ingest rate stays below 50 k samples per second, a single‑node VictoriaMetrics instance is sufficient. When ingest exceeds several hundred thousand samples per second, partitioning into multiple instances reduces the failure domain.
For example, with a 15‑second scrape interval, 50 000 samples per second correspond to monitoring roughly 3 750 machines (200 standard OS metrics per host): 50000 / (200 / 15) = 3750 Official benchmark data from VictoriaMetrics shows a single node handling up to 10 M active series, 800 K samples/sec ingestion, and more than 2 trillion datapoints, consuming about 50 GB of RAM.
Here are some numbers from our single-node VictoriaMetrics setup:
active time series: 10M
ingestion rate: 800K samples/sec
total number of datapoints: > 2 trillion
total number of entries in inverted index: > 1 billion
daily time series churn rate: 2.6M
data size on disk: 1.5 TB
index size on disk: 27 GB
average datapoint size on disk: 0.75 bytes
range query rate: 16 rps
instant query rate: 25 rps
range query duration: max 0.5s; median 0.05s; 97th percentile 0.29s
instant query duration: max 2.1s; median 0.04s; 97th percentile 0.15s
VictoriaMetrics consumes about 50GB of RAM.3. Single‑Node vs Cluster Deployment
When ingest stays under 800 k samples/sec, a high‑spec single node (e.g., 128 CPU / 256 GB RAM) is often enough. To improve data reliability, store the data on durable cloud disks. If durable disks are unavailable, switch to the clustered version, which adds replicas for fault tolerance but incurs extra network overhead and slightly lower performance.
4. Key Metrics to Watch
Fundamental host metrics (CPU, memory, disk I/O) are always essential. In addition, monitor VictoriaMetrics‑specific metrics such as process CPU/memory, write latency, and the rate of newly created series.
Official single‑node Grafana dashboard: https://grafana.com/grafana/dashboards/10229
Official cluster dashboard: https://grafana.com/grafana/dashboards/11176
Community cluster dashboard: https://grafana.com/grafana/dashboards/11831
Alerting rules are provided in the repository:
https://github.com/VictoriaMetrics/VictoriaMetrics/tree/master/deployment/docker/rules5. Store Monitoring Data Separately
Run a lightweight Prometheus or a dedicated single‑node VictoriaMetrics instance to store the metrics of the monitoring system itself. This isolation prevents a failure in the production time‑series database from affecting the observability of the monitoring stack.
6. VMUI Explore
VictoriaMetrics includes a simple UI (VMUI Explore) that allows ad‑hoc queries, high‑cardinality analysis, and detection of heavy queries.
7. Discard Unnecessary Metrics
VictoriaMetrics offers a Bloom‑filter based high‑cardinality limiter. The -storage.maxHourlySeries flag caps the number of new unique series that can be added in the last hour; excess series are logged and dropped.
-storage.maxHourlySeries int
The maximum number of unique series that can be added to storage during the last hour. Excess series are logged and dropped. Useful for limiting series cardinality. See https://docs.victoriametrics.com/#cardinality-limiter. Also see -storage.maxDailySeries.Another useful flag is -dedup.minScrapeInterval, which keeps only the last sample per series within the specified interval, helping to eliminate duplicate points from HA scrapers.
-dedup.minScrapeInterval duration
Keep only the last sample in each time series per interval equal to -dedup.minScrapeInterval > 0. See https://docs.victoriametrics.com/#deduplication and https://docs.victoriametrics.com/#downsampling.Adjust these settings based on your scrape interval (e.g., set -dedup.minScrapeInterval to 15 s for a 15 s scrape cadence) to reduce storage pressure while preserving essential data.
Feel free to share feedback or additional tips in the comments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
