How to Replace Prometheus Local Storage with VictoriaMetrics for High‑Performance Long‑Term Monitoring
This guide explains why Prometheus’s local TSDB struggles at scale, compares alternative remote‑storage solutions, and provides a step‑by‑step walkthrough for deploying VictoriaMetrics (single‑node or clustered), configuring remote_write, tuning performance, handling multi‑tenant use cases, and troubleshooting common issues.
Overview
Prometheus’s built‑in TSDB is designed for short‑term storage and typically retains data for only 15‑30 days. When monitoring thousands of nodes and millions of active time series, a single Prometheus instance runs out of disk space and lacks native high‑availability, requiring external tools like Thanos or federation for aggregation. VictoriaMetrics offers a single‑binary, PromQL‑compatible TSDB with 7‑10× higher compression, millisecond query latency, and low resource consumption, making it a cost‑effective remote storage solution for Prometheus.
Technical Features
High compression : custom algorithm achieves 10:1‑15:1 compression, reducing 3 months of data from ~800 GB (Prometheus+Thanos) to ~120 GB on VictoriaMetrics.
PromQL compatibility : supports MetricsQL (a superset) and additional functions like range_median() and optimized histogram_quantile().
Simple deployment : single binary for single‑node, three components (vminsert, vmselect, vmstorage) for clusters, no external dependencies such as Consul, etcd, or Kafka.
Multi‑tenant support : tenant ID is encoded in the URL path (e.g., /insert/0/… for single‑tenant, /insert/1/… for tenant 1).
Built‑in deduplication : -dedup.minScrapeInterval removes duplicate samples from HA Prometheus instances.
Typical Use Cases
Long‑term storage of Prometheus metrics (≥3 months).
Large‑scale monitoring clusters with >500 nodes and >10 million active series.
Multi‑tenant monitoring platforms where each business unit gets an isolated tenant.
Prometheus HA setups where two instances write to the same remote storage.
Environment Requirements
Supported OS: CentOS 7+, Ubuntu 18.04+, Debian 10+ (Ubuntu 22.04 LTS recommended). Recommended hardware for single‑node: 4 CPU, 8 GB RAM; for clusters: each node 8 CPU, 16 GB RAM. SSD is strongly recommended; HDD works but incurs 2‑3× higher query latency. Network ports: 8428 (single‑node API), 8480 (vminsert), 8481 (vmselect), 8482 (vmstorage).
Deployment Steps
1. Preparation
Check OS version, CPU, memory, disk space, and time synchronization (install chrony if needed).
Create a dedicated system user victoriametrics without a login shell.
Create data and configuration directories ( /data/victoriametrics, /etc/victoriametrics, /var/log/victoriametrics) and assign ownership to the user.
2. Installation
Download the desired version (e.g., v1.102.0) from GitHub and extract the binary:
# Set version
VM_VERSION="v1.102.0"
# Download single‑node binary
wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/${VM_VERSION}/victoria-metrics-linux-amd64-${VM_VERSION}.tar.gz
# Extract and install
tar xzf victoria-metrics-linux-amd64-${VM_VERSION}.tar.gz
sudo mv victoria-metrics-prod /usr/local/bin/victoria-metrics
sudo chmod +x /usr/local/bin/victoria-metrics
# Verify
victoria-metrics --versionFor clusters, also download vminsert, vmselect, and vmstorage binaries and install them to /usr/local/bin.
3. Core Configuration
Write systemd service files (strip all attributes, keep only the command line). Example for single‑node:
# /etc/systemd/system/victoria-metrics.service
[Unit]
Description=VictoriaMetrics single‑node
After=network.target
[Service]
Type=simple
User=victoriametrics
Group=victoriametrics
ExecStart=/usr/local/bin/victoria-metrics \
-storageDataPath=/data/victoriametrics \
-retentionPeriod=6 \
-httpListenAddr=:8428 \
-maxLabelsPerTimeseries=40 \
-search.maxUniqueTimeseries=5000000 \
-search.maxQueryDuration=60s \
-search.maxConcurrentRequests=32 \
-memory.allowedPercent=60 \
-dedup.minScrapeInterval=15s \
-loggerTimezone=Asia/Shanghai \
-loggerOutput=stderr
ExecStop=/bin/kill -s SIGTERM $MAINPID
Restart=always
RestartSec=5
LimitNOFILE=65536
LimitNPROC=32000
[Install]
WantedBy=multi-user.targetCluster components use similar service files with component‑specific flags (e.g., -storageNode for vminsert, -httpListenAddr for each component, and matching -replicationFactor).
4. Prometheus remote_write Integration
Add the following to /etc/prometheus/prometheus.yml (replace IPs with your VM endpoints):
remote_write:
- url: "http://10.0.1.21:8480/insert/0/prometheus/api/v1/write"
queue_config:
max_samples_per_send: 10000
max_shards: 30
capacity: 20000
batch_send_deadline: 5s
write_relabel_configs:
- source_labels: [__name__]
regex: "go_.*"
action: dropThe tenant ID is the number after /insert/ (use 0 for single‑tenant). Adjust max_shards based on scrape rate (e.g., >10 k samples / s may need 50+).
5. Start Services and Verify
# Reload systemd and enable services
sudo systemctl daemon-reload
sudo systemctl enable --now victoriametrics # single‑node
# For cluster, start vmstorage first, then vminsert, then vmselect
sudo systemctl enable --now vmstorage
sudo systemctl enable --now vminsert
sudo systemctl enable --now vmselect
# Health checks
curl -s http://localhost:8428/health
curl -s http://10.0.1.21:8480/health # vminsert
curl -s http://10.0.1.22:8481/health # vmselectTest write and query:
# Write a test metric
curl -d 'test_metric{job="test",instance="localhost"} 42' http://localhost:8428/api/v1/import/prometheus
# Query it back
curl -s 'http://localhost:8428/api/v1/query?query=test_metric' | python3 -m json.toolBest Practices and Tuning
Storage Planning
Estimate disk usage with:
# Disk = active_series × sample_size_after_compression × samples_per_day × retention_days × replication_factor
# Example: 5 M series, 15 s interval (5760 samples/day), 1 B per sample, 180 days, RF=2 → ~10.4 TBTypical per‑sample sizes: counters ≈0.4 B, gauges ≈1‑1.5 B, histograms ≈0.8 B. Reserve 30 % free space for merge operations.
Downsampling
Long‑term data can be downsampled to 1‑minute or 5‑minute resolution using vmalert recording rules (open‑source) or the enterprise downsampling feature.
# Example vmalert rule for 5‑minute average of CPU usage
- record: cpu_usage_avg:5m
expr: avg_over_time(node_cpu_seconds_total{mode!="idle"}[5m])Deduplication
Set -dedup.minScrapeInterval to the Prometheus scrape interval (e.g., 15 s). Apply the same flag to both vminsert and vmselect. Mismatched values cause duplicate data or missing points.
Performance Tuning
Increase -memory.allowedPercent (e.g., 60) to prevent OOM.
Raise -search.maxUniqueTimeseries (e.g., 5‑10 M) for large queries.
Adjust -maxLabelsPerTimeseries if you have many labels (40 is safe).
Set system limits: LimitNOFILE=131072, increase nofile limits in /etc/security/limits.conf, and tune kernel parameters ( vm.max_map_count, net.core.somaxconn, etc.).
For SSDs, use the “none” I/O scheduler and increase nr_requests.
Security Hardening
Typical measures:
Run behind an Nginx reverse proxy with basic auth for query endpoints.
Bind services to internal IPs only (e.g., -httpListenAddr=10.0.1.11:8482).
Configure authentication keys for delete, snapshot, and tenant‑cache APIs.
Restrict firewall access to VM ports (allow only monitoring network).
Troubleshooting
Common Errors
cannot allocate memory: set -memory.allowedPercent and check for large queries. too many open files: increase LimitNOFILE and OS limits.
Prometheus remote_write context deadline exceeded: increase max_shards or maxConcurrentInserts on vminsert, verify network latency.
Query error “the number of unique timeseries exceeds …”: raise -search.maxUniqueTimeseries or narrow the PromQL range.
Cluster inconsistency: ensure -replicationFactor matches between vminsert and vmselect.
Disk not freeing after retention change: remember that VM deletes whole month directories only after they expire.
Debugging Steps
# View logs
sudo journalctl -u victoriametrics -f --no-pager
# Enable detailed logging (temporarily)
# Add -loggerLevel=DEBUG to the service ExecStart
# Enable slow‑query logging
# Add -search.logSlowQueryDuration=5s
# Check internal TSDB status
curl -s http://localhost:8428/api/v1/status/tsdb | python3 -m json.tool
# Monitor memory usage
curl -s http://localhost:8428/metrics | grep process_resident_memory_bytesMonitoring VM Itself
Key metrics to export to Prometheus: vm_rows_inserted_total – write rate. vm_http_requests_total – query rate. vm_cache_entries – active series. vm_data_size_bytes – disk usage. vm_merge_need_free_disk_space – merge blocked by low space. vm_http_request_duration_seconds (p99) – query latency.
Backup and Restore
Backup Procedure
# Create a snapshot
SNAPSHOT=$(curl -s http://localhost:8428/snapshot/create | python3 -c "import sys,json;print(json.load(sys.stdin)['snapshot'])")
# Rsync the snapshot directory
rsync -a /data/victoriametrics/snapshots/${SNAPSHOT}/ /backup/victoriametrics/${SNAPSHOT}/
# Delete the snapshot to free space
curl -s http://localhost:8428/snapshot/delete?snapshot=${SNAPSHOT}
# Optionally prune old backups (e.g., keep 7 days)
find /backup/victoriametrics -maxdepth 1 -type d -mtime +7 -exec rm -rf {} \;Restore Procedure
# Stop services
sudo systemctl stop victoriametrics # or vmstorage/vminsert/vmselect
# Backup current data directory
sudo mv /data/victoriametrics /data/victoriametrics.bak.$(date +%Y%m%d)
# Restore from backup
sudo mkdir -p /data/victoriametrics
sudo rsync -a /backup/victoriametrics/20250115_020000/ /data/victoriametrics/
# Fix permissions
sudo chown -R victoriametrics:victoriametrics /data/victoriametrics
# Start services
sudo systemctl start victoriametrics # or individual components
# Verify data
sleep 10
curl -s 'http://localhost:8428/api/v1/query?query=count({__name__=~".+"})' | python3 -m json.toolConclusion
VictoriaMetrics provides a drop‑in, high‑performance remote storage for Prometheus, dramatically reducing storage costs and query latency while simplifying operations. Key takeaways include configuring proper retention, memory limits, deduplication, and scaling components according to load. For large‑scale or multi‑tenant environments, the three‑component cluster offers horizontal scalability and data redundancy.
References
VictoriaMetrics official documentation.
VictoriaMetrics GitHub repository.
Prometheus remote_write protocol specification.
Performance comparison between VictoriaMetrics and Thanos (official benchmarks).
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
