Databases 43 min read

How to Deploy VictoriaMetrics for High‑Performance Prometheus Remote Storage

This article walks through the challenges of scaling Prometheus storage, compares Thanos, Cortex, and VictoriaMetrics, and provides a complete step‑by‑step guide—including hardware requirements, configuration, deployment, tuning, multi‑tenant setup, and troubleshooting—to replace Prometheus local TSDB with VictoriaMetrics for long‑term, high‑performance monitoring.

Raymond Ops
Raymond Ops
Raymond Ops
How to Deploy VictoriaMetrics for High‑Performance Prometheus Remote Storage

Overview

Prometheus stores data locally and is designed for short‑term retention (15‑30 days). In large‑scale environments (200+ nodes, 50 k+ active series, ~500 k samples/s) a single Prometheus instance runs out of disk and memory, and native high‑availability is missing because each instance writes its own data. The article evaluates three alternatives—Thanos, Cortex/Mimir, and VictoriaMetrics—showing that VictoriaMetrics offers the simplest deployment, 7‑10× higher compression, millisecond query latency, and lower resource consumption.

In the author’s production cluster, VictoriaMetrics reduced storage cost by ~60 % and query latency from seconds to <200 ms.

Technical Characteristics

Compression : custom algorithm achieves 10:1–15:1 compression (e.g., 1 M series with 15 s interval for one month uses 20‑30 GB).

PromQL Compatibility : fully compatible with PromQL; supports MetricsQL extensions such as range_median() and optimized histogram_quantile().

Deployment : single binary for the standalone version; cluster mode consists of three stateless components (vminsert, vmselect, vmstorage) with no external dependencies.

Multi‑tenant support : tenant ID is encoded in the URL path (e.g., /insert/0/… for single‑tenant, /insert/1/… for tenant 1).

Deployment Steps

1. System Preparation

# Verify OS version
cat /etc/os-release
# Check CPU, memory, and disk
nproc
free -h
df -h
# Ensure time synchronization
timedatectl status
# Install chrony if needed
sudo apt install -y chrony
sudo systemctl enable --now chrony
chronyc tracking

2. Create User and Directories

# Create a non‑login user
sudo useradd -r -s /sbin/nologin victoriametrics
# Create data and config directories
sudo mkdir -p /data/victoriametrics /etc/victoriametrics /var/log/victoriametrics
sudo chown -R victoriametrics:victoriametrics /data/victoriametrics /var/log/victoriametrics

3. Install Binary

# Set version
VM_VERSION="v1.102.0"
# Download and extract
wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/${VM_VERSION}/victoria-metrics-linux-amd64-${VM_VERSION}.tar.gz
tar xzf victoria-metrics-linux-amd64-${VM_VERSION}.tar.gz
sudo mv victoria-metrics-prod /usr/local/bin/victoria-metrics
sudo chmod +x /usr/local/bin/victoria-metrics

4. Service Files

Standalone (single binary):

sudo tee /etc/systemd/system/victoriametrics.service > /dev/null << 'EOF'
[Unit]
Description=VictoriaMetrics single‑node
After=network.target

[Service]
Type=simple
User=victoriametrics
Group=victoriametrics
ExecStart=/usr/local/bin/victoria-metrics \
    -storageDataPath=/data/victoriametrics \
    -retentionPeriod=6 \
    -httpListenAddr=:8428 \
    -maxLabelsPerTimeseries=40 \
    -search.maxUniqueTimeseries=5000000 \
    -search.maxQueryDuration=60s \
    -memory.allowedPercent=60 \
    -dedup.minScrapeInterval=15s \
    -loggerTimezone=Asia/Shanghai \
    -loggerOutput=stderr
ExecStop=/bin/kill -s SIGTERM $MAINPID
Restart=always
RestartSec=5
LimitNOFILE=65536
LimitNPROC=32000

[Install]
WantedBy=multi-user.target
EOF

Cluster mode requires three unit files (vmstorage, vminsert, vmselect). Example for vmstorage.service:

sudo tee /etc/systemd/system/vmstorage.service > /dev/null << 'EOF'
[Unit]
Description=VictoriaMetrics vmstorage
After=network.target

[Service]
Type=simple
User=victoriametrics
Group=victoriametrics
ExecStart=/usr/local/bin/vmstorage \
    -storageDataPath=/data/vmstorage \
    -retentionPeriod=6 \
    -httpListenAddr=:8482 \
    -vminsertAddr=:8400 \
    -vmselectAddr=:8401 \
    -dedup.minScrapeInterval=15s \
    -memory.allowedPercent=60 \
    -search.maxUniqueTimeseries=5000000 \
    -loggerTimezone=Asia/Shanghai \
    -loggerOutput=stderr
Restart=always
RestartSec=5
LimitNOFILE=131072
EOF

Similar unit files are created for vminsert and vmselect, setting -storageNode and -replicationFactor as described later.

5. Configure Prometheus remote_write

global:
  scrape_interval: 15s
  external_labels:
    replica: prom-a
    cluster: prod
remote_write:
  - url: "http://10.0.1.21:8480/insert/0/prometheus/api/v1/write"
    queue_config:
      max_samples_per_send: 10000
      max_shards: 30
    write_relabel_configs:
      - source_labels: [__name__]
        regex: "go_.*"
        action: drop

Two Prometheus instances can use the same URL; VictoriaMetrics deduplicates samples when -dedup.minScrapeInterval matches the scrape interval (15 s).

6. Cluster Example (3 storage nodes, 2 insert, 2 select)

IP layout:

vmstorage‑1: 10.0.1.11 (ports 8482/8400/8401)

vmstorage‑2: 10.0.1.12 (same ports)

vmstorage‑3: 10.0.1.13 (same ports)

vminsert: 10.0.1.21 (port 8480)

vmselect: 10.0.1.22 (port 8481)

Key parameters: -replicationFactor=2 (each sample stored on two storage nodes). -search.maxUniqueTimeseries=10000000 for large queries. -search.maxSamplesPerQuery=500000000 to protect against OOM on full‑scan queries.

Best Practices & Pitfalls

Storage Capacity Planning

Formula: Disk = ActiveSeries × CompressedSampleSize × SamplesPerDay × RetentionDays × ReplicationFactor . Example: 5 M series, 15 s interval (5 760 samples/day), 1 byte per compressed sample, 180 days, replication 2 → ≈10.4 TB.

Empirical compression rates:

Counters ≈ 0.4 B/sample

Gauges ≈ 1.0‑1.5 B/sample

Histograms ≈ 0.8 B/sample

Reserve at least 30 % free space because merge operations need temporary space.

Deduplication Settings

Set -dedup.minScrapeInterval to the Prometheus scrape_interval (or larger). The same flag must be configured on both vmstorage and vmselect. Mismatched values cause duplicate data or query glitches.

Memory Limits

Always set -memory.allowedPercent=60 (or appropriate value). Without it the process can consume all system memory and be killed by the OOM killer.

Remote Write Tuning

Adjust Prometheus queue_config: max_samples_per_send – increase from default 500 to 10 000 or higher for high‑throughput environments. max_shards – raise from 30 to 50+ when write latency appears.

Use write_relabel_configs to drop high‑cardinality or unnecessary metrics (e.g., go_.*).

Performance Tuning

Increase file descriptor limit: LimitNOFILE=131072 in systemd units and nofile limits in /etc/security/limits.conf.

Set kernel parameters for networking and mmap: vm.max_map_count=262144, net.core.somaxconn=65535, etc.

Use SSD; set I/O scheduler to none and increase nr_requests for better throughput.

Enable slow‑query logging with -search.logSlowQueryDuration=5s and increase -search.maxConcurrentRequests when query load grows.

Security Hardening

Typical measures:

Terminate HTTP endpoints behind Nginx with basic auth for read paths and IP whitelist for write paths.

Bind services to internal IPs only (e.g., -httpListenAddr=10.0.1.11:8482).

Set API keys for delete, snapshot, and reset‑cache operations.

Restrict firewall to allow only monitoring hosts.

Troubleshooting & Monitoring

Common Issues

OOM : occurs when -memory.allowedPercent is missing. Fix by adding the flag and checking for large full‑scan queries.

Too many open files : raise LimitNOFILE in the service file and system limits.

Remote_write timeout : increase Prometheus max_shards and VM -maxConcurrentInserts, verify network latency.

Query returns “unique timeseries exceeds” : raise -search.maxUniqueTimeseries or narrow the PromQL range.

Cluster data inconsistency : ensure -replicationFactor is identical on vminsert and vmselect.

Disk space not freeing : clean old snapshots ( /snapshot/delete_all) and verify snapshotsMaxAge settings.

Health Checks

# VM single‑node health
curl -s http://localhost:8428/health
# Cluster component health
curl -s http://10.0.1.21:8480/health   # vminsert
curl -s http://10.0.1.22:8481/health   # vmselect
curl -s http://10.0.1.11:8482/health   # vmstorage

Key Metrics to Monitor (Prometheus scrape)

vm_rows_inserted_total

– write rate. vm_http_requests_total – query rate. process_resident_memory_bytes – memory usage (should stay below -memory.allowedPercent). vm_free_disk_space_bytes / vm_available_disk_space_bytes – alert when < 15 %. vm_http_request_duration_seconds p99 – keep < 1 s, alert > 5 s.

Prometheus Alert Rules (example)

groups:
- name: victoriametrics_alerts
  interval: 30s
  rules:
  - alert: VMWriteRateDrop
    expr: rate(vm_rows_inserted_total[5m]) < 0.5 * rate(vm_rows_inserted_total[5m] offset 1h)
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "VictoriaMetrics write rate dropped >50%"
  - alert: VMDiskSpaceLow
    expr: vm_free_disk_space_bytes / vm_available_disk_space_bytes < 0.15
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "VictoriaMetrics disk free space <15%"
  - alert: VMSlowQueries
    expr: histogram_quantile(0.99, rate(vm_request_duration_seconds_bucket[5m])) > 5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "VictoriaMetrics P99 query latency >5s"

Backup & Restore

Backup Procedure

Use VM’s snapshot API, then rsync the snapshot directory to a backup location. Example script (simplified):

#!/bin/bash
set -euo pipefail
VM_URL="http://localhost:8428"
BACKUP_DIR="/backup/victoriametrics"
DATE=$(date +%Y%m%d_%H%M%S)
SNAPSHOT=$(curl -s $VM_URL/snapshot/create | python3 -c "import sys,json;print(json.load(sys.stdin)['snapshot'])")
rsync -a /data/victoriametrics/snapshots/$SNAPSHOT/ $BACKUP_DIR/$DATE/
curl -s $VM_URL/snapshot/delete?snapshot=$SNAPSHOT

Restore Procedure

# Stop services
sudo systemctl stop victoriametrics   # or vmstorage/vminsert/vmselect
# Backup current data directory
sudo mv /data/victoriametrics /data/victoriametrics.bak.$(date +%Y%m%d)
# Restore backup
sudo rsync -a /backup/victoriametrics/20250115_020000/ /data/victoriametrics/
sudo chown -R victoriametrics:victoriametrics /data/victoriametrics
# Start services
sudo systemctl start victoriametrics   # or cluster units
# Verify data
curl -s 'http://localhost:8428/api/v1/query?query=count({__name__=~".+"})' | python3 -m json.tool

Conclusion

The guide demonstrates that VictoriaMetrics can replace Prometheus local storage for long‑term monitoring, delivering up to tenfold storage savings, millisecond‑level query latency, and a straightforward deployment model. Key takeaways include configuring deduplication correctly, limiting memory usage, sizing storage based on realistic compression rates, and tuning remote_write parameters to avoid bottlenecks.

Further learning paths include exploring VMAlert for alert evaluation, using the VictoriaMetrics Operator for Kubernetes‑native management, and testing VictoriaLogs for unified metrics‑and‑logs observability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringdeploymentPerformance TuningPrometheustime_series_databaseVictoriaMetricsremote_write
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.