How to Deploy VictoriaMetrics for High‑Performance Prometheus Remote Storage
This article walks through the challenges of scaling Prometheus storage, compares Thanos, Cortex, and VictoriaMetrics, and provides a complete step‑by‑step guide—including hardware requirements, configuration, deployment, tuning, multi‑tenant setup, and troubleshooting—to replace Prometheus local TSDB with VictoriaMetrics for long‑term, high‑performance monitoring.
Overview
Prometheus stores data locally and is designed for short‑term retention (15‑30 days). In large‑scale environments (200+ nodes, 50 k+ active series, ~500 k samples/s) a single Prometheus instance runs out of disk and memory, and native high‑availability is missing because each instance writes its own data. The article evaluates three alternatives—Thanos, Cortex/Mimir, and VictoriaMetrics—showing that VictoriaMetrics offers the simplest deployment, 7‑10× higher compression, millisecond query latency, and lower resource consumption.
In the author’s production cluster, VictoriaMetrics reduced storage cost by ~60 % and query latency from seconds to <200 ms.
Technical Characteristics
Compression : custom algorithm achieves 10:1–15:1 compression (e.g., 1 M series with 15 s interval for one month uses 20‑30 GB).
PromQL Compatibility : fully compatible with PromQL; supports MetricsQL extensions such as range_median() and optimized histogram_quantile().
Deployment : single binary for the standalone version; cluster mode consists of three stateless components (vminsert, vmselect, vmstorage) with no external dependencies.
Multi‑tenant support : tenant ID is encoded in the URL path (e.g., /insert/0/… for single‑tenant, /insert/1/… for tenant 1).
Deployment Steps
1. System Preparation
# Verify OS version
cat /etc/os-release
# Check CPU, memory, and disk
nproc
free -h
df -h
# Ensure time synchronization
timedatectl status
# Install chrony if needed
sudo apt install -y chrony
sudo systemctl enable --now chrony
chronyc tracking2. Create User and Directories
# Create a non‑login user
sudo useradd -r -s /sbin/nologin victoriametrics
# Create data and config directories
sudo mkdir -p /data/victoriametrics /etc/victoriametrics /var/log/victoriametrics
sudo chown -R victoriametrics:victoriametrics /data/victoriametrics /var/log/victoriametrics3. Install Binary
# Set version
VM_VERSION="v1.102.0"
# Download and extract
wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/${VM_VERSION}/victoria-metrics-linux-amd64-${VM_VERSION}.tar.gz
tar xzf victoria-metrics-linux-amd64-${VM_VERSION}.tar.gz
sudo mv victoria-metrics-prod /usr/local/bin/victoria-metrics
sudo chmod +x /usr/local/bin/victoria-metrics4. Service Files
Standalone (single binary):
sudo tee /etc/systemd/system/victoriametrics.service > /dev/null << 'EOF'
[Unit]
Description=VictoriaMetrics single‑node
After=network.target
[Service]
Type=simple
User=victoriametrics
Group=victoriametrics
ExecStart=/usr/local/bin/victoria-metrics \
-storageDataPath=/data/victoriametrics \
-retentionPeriod=6 \
-httpListenAddr=:8428 \
-maxLabelsPerTimeseries=40 \
-search.maxUniqueTimeseries=5000000 \
-search.maxQueryDuration=60s \
-memory.allowedPercent=60 \
-dedup.minScrapeInterval=15s \
-loggerTimezone=Asia/Shanghai \
-loggerOutput=stderr
ExecStop=/bin/kill -s SIGTERM $MAINPID
Restart=always
RestartSec=5
LimitNOFILE=65536
LimitNPROC=32000
[Install]
WantedBy=multi-user.target
EOFCluster mode requires three unit files (vmstorage, vminsert, vmselect). Example for vmstorage.service:
sudo tee /etc/systemd/system/vmstorage.service > /dev/null << 'EOF'
[Unit]
Description=VictoriaMetrics vmstorage
After=network.target
[Service]
Type=simple
User=victoriametrics
Group=victoriametrics
ExecStart=/usr/local/bin/vmstorage \
-storageDataPath=/data/vmstorage \
-retentionPeriod=6 \
-httpListenAddr=:8482 \
-vminsertAddr=:8400 \
-vmselectAddr=:8401 \
-dedup.minScrapeInterval=15s \
-memory.allowedPercent=60 \
-search.maxUniqueTimeseries=5000000 \
-loggerTimezone=Asia/Shanghai \
-loggerOutput=stderr
Restart=always
RestartSec=5
LimitNOFILE=131072
EOFSimilar unit files are created for vminsert and vmselect, setting -storageNode and -replicationFactor as described later.
5. Configure Prometheus remote_write
global:
scrape_interval: 15s
external_labels:
replica: prom-a
cluster: prod
remote_write:
- url: "http://10.0.1.21:8480/insert/0/prometheus/api/v1/write"
queue_config:
max_samples_per_send: 10000
max_shards: 30
write_relabel_configs:
- source_labels: [__name__]
regex: "go_.*"
action: dropTwo Prometheus instances can use the same URL; VictoriaMetrics deduplicates samples when -dedup.minScrapeInterval matches the scrape interval (15 s).
6. Cluster Example (3 storage nodes, 2 insert, 2 select)
IP layout:
vmstorage‑1: 10.0.1.11 (ports 8482/8400/8401)
vmstorage‑2: 10.0.1.12 (same ports)
vmstorage‑3: 10.0.1.13 (same ports)
vminsert: 10.0.1.21 (port 8480)
vmselect: 10.0.1.22 (port 8481)
Key parameters: -replicationFactor=2 (each sample stored on two storage nodes). -search.maxUniqueTimeseries=10000000 for large queries. -search.maxSamplesPerQuery=500000000 to protect against OOM on full‑scan queries.
Best Practices & Pitfalls
Storage Capacity Planning
Formula: Disk = ActiveSeries × CompressedSampleSize × SamplesPerDay × RetentionDays × ReplicationFactor . Example: 5 M series, 15 s interval (5 760 samples/day), 1 byte per compressed sample, 180 days, replication 2 → ≈10.4 TB.
Empirical compression rates:
Counters ≈ 0.4 B/sample
Gauges ≈ 1.0‑1.5 B/sample
Histograms ≈ 0.8 B/sample
Reserve at least 30 % free space because merge operations need temporary space.
Deduplication Settings
Set -dedup.minScrapeInterval to the Prometheus scrape_interval (or larger). The same flag must be configured on both vmstorage and vmselect. Mismatched values cause duplicate data or query glitches.
Memory Limits
Always set -memory.allowedPercent=60 (or appropriate value). Without it the process can consume all system memory and be killed by the OOM killer.
Remote Write Tuning
Adjust Prometheus queue_config: max_samples_per_send – increase from default 500 to 10 000 or higher for high‑throughput environments. max_shards – raise from 30 to 50+ when write latency appears.
Use write_relabel_configs to drop high‑cardinality or unnecessary metrics (e.g., go_.*).
Performance Tuning
Increase file descriptor limit: LimitNOFILE=131072 in systemd units and nofile limits in /etc/security/limits.conf.
Set kernel parameters for networking and mmap: vm.max_map_count=262144, net.core.somaxconn=65535, etc.
Use SSD; set I/O scheduler to none and increase nr_requests for better throughput.
Enable slow‑query logging with -search.logSlowQueryDuration=5s and increase -search.maxConcurrentRequests when query load grows.
Security Hardening
Typical measures:
Terminate HTTP endpoints behind Nginx with basic auth for read paths and IP whitelist for write paths.
Bind services to internal IPs only (e.g., -httpListenAddr=10.0.1.11:8482).
Set API keys for delete, snapshot, and reset‑cache operations.
Restrict firewall to allow only monitoring hosts.
Troubleshooting & Monitoring
Common Issues
OOM : occurs when -memory.allowedPercent is missing. Fix by adding the flag and checking for large full‑scan queries.
Too many open files : raise LimitNOFILE in the service file and system limits.
Remote_write timeout : increase Prometheus max_shards and VM -maxConcurrentInserts, verify network latency.
Query returns “unique timeseries exceeds” : raise -search.maxUniqueTimeseries or narrow the PromQL range.
Cluster data inconsistency : ensure -replicationFactor is identical on vminsert and vmselect.
Disk space not freeing : clean old snapshots ( /snapshot/delete_all) and verify snapshotsMaxAge settings.
Health Checks
# VM single‑node health
curl -s http://localhost:8428/health
# Cluster component health
curl -s http://10.0.1.21:8480/health # vminsert
curl -s http://10.0.1.22:8481/health # vmselect
curl -s http://10.0.1.11:8482/health # vmstorageKey Metrics to Monitor (Prometheus scrape)
vm_rows_inserted_total– write rate. vm_http_requests_total – query rate. process_resident_memory_bytes – memory usage (should stay below -memory.allowedPercent). vm_free_disk_space_bytes / vm_available_disk_space_bytes – alert when < 15 %. vm_http_request_duration_seconds p99 – keep < 1 s, alert > 5 s.
Prometheus Alert Rules (example)
groups:
- name: victoriametrics_alerts
interval: 30s
rules:
- alert: VMWriteRateDrop
expr: rate(vm_rows_inserted_total[5m]) < 0.5 * rate(vm_rows_inserted_total[5m] offset 1h)
for: 10m
labels:
severity: warning
annotations:
summary: "VictoriaMetrics write rate dropped >50%"
- alert: VMDiskSpaceLow
expr: vm_free_disk_space_bytes / vm_available_disk_space_bytes < 0.15
for: 5m
labels:
severity: critical
annotations:
summary: "VictoriaMetrics disk free space <15%"
- alert: VMSlowQueries
expr: histogram_quantile(0.99, rate(vm_request_duration_seconds_bucket[5m])) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "VictoriaMetrics P99 query latency >5s"Backup & Restore
Backup Procedure
Use VM’s snapshot API, then rsync the snapshot directory to a backup location. Example script (simplified):
#!/bin/bash
set -euo pipefail
VM_URL="http://localhost:8428"
BACKUP_DIR="/backup/victoriametrics"
DATE=$(date +%Y%m%d_%H%M%S)
SNAPSHOT=$(curl -s $VM_URL/snapshot/create | python3 -c "import sys,json;print(json.load(sys.stdin)['snapshot'])")
rsync -a /data/victoriametrics/snapshots/$SNAPSHOT/ $BACKUP_DIR/$DATE/
curl -s $VM_URL/snapshot/delete?snapshot=$SNAPSHOTRestore Procedure
# Stop services
sudo systemctl stop victoriametrics # or vmstorage/vminsert/vmselect
# Backup current data directory
sudo mv /data/victoriametrics /data/victoriametrics.bak.$(date +%Y%m%d)
# Restore backup
sudo rsync -a /backup/victoriametrics/20250115_020000/ /data/victoriametrics/
sudo chown -R victoriametrics:victoriametrics /data/victoriametrics
# Start services
sudo systemctl start victoriametrics # or cluster units
# Verify data
curl -s 'http://localhost:8428/api/v1/query?query=count({__name__=~".+"})' | python3 -m json.toolConclusion
The guide demonstrates that VictoriaMetrics can replace Prometheus local storage for long‑term monitoring, delivering up to tenfold storage savings, millisecond‑level query latency, and a straightforward deployment model. Key takeaways include configuring deduplication correctly, limiting memory usage, sizing storage based on realistic compression rates, and tuning remote_write parameters to avoid bottlenecks.
Further learning paths include exploring VMAlert for alert evaluation, using the VictoriaMetrics Operator for Kubernetes‑native management, and testing VictoriaLogs for unified metrics‑and‑logs observability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
