Eliminate Monitoring Blind Spots: Hands‑On Enterprise‑Grade Prometheus + Grafana Deployment
This comprehensive guide walks you through the end‑to‑end setup of a production‑grade Prometheus and Grafana monitoring stack, covering architecture choices, installation steps, configuration details, high‑availability designs, performance tuning, security hardening, troubleshooting, backup strategies, and best‑practice recommendations.
Overview
In a production environment with several hundred machines the team moved from manual checks to a Prometheus + Grafana stack in 2019. After more than five years the system monitors host, container, middleware and business metrics, ingesting over 20 million samples per day.
Prometheus uses a pull model, giving the monitoring side immediate visibility when a target disappears. Its built‑in TSDB can write millions of samples per second and serve queries in milliseconds. Grafana provides rich visualisation and, together with Alertmanager, completes the monitoring pipeline.
Key Features
Pull model + service discovery : Prometheus actively scrapes targets and integrates with Consul, Kubernetes and file‑based discovery. Over 400 micro‑service instances are registered automatically via Kubernetes.
PromQL : Vector operations, aggregation and prediction functions such as
predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0enable 24‑hour disk‑space warnings.
Local TSDB + remote storage : Data is stored locally by default. For larger volumes Thanos or VictoriaMetrics can be attached for long‑term storage. The production setup keeps 15 days of hot data locally and syncs older data to S3 via a Thanos Sidecar.
Environment Requirements
OS: CentOS 7+ / Ubuntu 20.04+ (Ubuntu 22.04 LTS recommended for cgroup v2 support)
CPU / RAM: 4 CPU 8 GB RAM (8 CPU 16 GB for >1000 targets)
Storage: SSD, at least 100 GB for TSDB data
Prometheus: 2.45+ (LTS) or 2.53+
Grafana: 10.0+ (10.2+ recommended)
Node Exporter: 1.7+ (versions < 1.6 have memory leaks on ARM)
Installation and Configuration
Preparation
Check OS version, CPU, memory and reserve >100 GB for TSDB.
Ensure NTP is enabled; time drift >1 min corrupts data.
Create a non‑login prometheus user and required directories ( /etc/prometheus, /var/lib/prometheus, /etc/prometheus/file_sd).
Open firewall ports: 9090/tcp (Prometheus), 3000/tcp (Grafana), 9100/tcp (Node Exporter).
Prometheus Installation (binary)
# Download Prometheus 2.53.0
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz
# Extract and install binaries
tar xzf prometheus-2.53.0.linux-amd64.tar.gz
cd prometheus-2.53.0.linux-amd64
sudo cp prometheus promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
# Copy console templates
sudo cp -r consoles /etc/prometheus/
sudo cp -r console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus/consoles /etc/prometheus/console_libraries
# Verify installation
prometheus --versionMain Configuration (prometheus.yml)
global:
scrape_interval: 15s # Balanced load and freshness for 200‑800 targets
evaluation_interval: 15s
scrape_timeout: 10s
external_labels:
cluster: 'prod-bj'
environment: 'production'
rule_files:
- "/etc/prometheus/rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['127.0.0.1:9093']
timeout: 10s
scrape_configs:
# Prometheus self‑monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
labels:
instance: 'prometheus-server'
# Node Exporter – host metrics
- job_name: 'node-exporter'
file_sd_configs:
- files: ['/etc/prometheus/file_sd/nodes.yml']
refresh_interval: 30s
relabel_configs:
- source_labels: [__address__]
regex: '(.+):([0-9]+)'
target_label: hostname
replacement: '${1}'
# cAdvisor – container metrics
- job_name: 'cadvisor'
file_sd_configs:
- files: ['/etc/prometheus/file_sd/cadvisor.yml']
refresh_interval: 30s
# Application custom metrics (Spring Boot actuator example)
- job_name: 'app-metrics'
metrics_path: '/actuator/prometheus'
file_sd_configs:
- files: ['/etc/prometheus/file_sd/apps.yml']
refresh_interval: 30s
relabel_configs:
- source_labels: [__meta_filepath]
regex: '.*/(.+)\.yml'
target_label: source_fileWhy 15 s? Tests showed that 10 s caused noticeable CPU increase when >500 targets were scraped, while 30 s missed short‑lived spikes. Fifteen seconds is the best trade‑off for medium‑scale deployments.
File‑Based Service Discovery
# Example node list (file_sd/nodes.yml)
- targets:
- '10.0.1.10:9100'
- '10.0.1.11:9100'
- '10.0.1.12:9100'
- '10.0.1.13:9100'
- '10.0.1.14:9100'
labels:
env: production
dc: beijing
role: app-server
- targets:
- '10.0.2.10:9100'
- '10.0.2.11:9100'
- '10.0.2.12:9100'
labels:
env: production
dc: beijing
role: db-serverThe file is refreshed automatically; the team runs a Bash script every five minutes that pulls the host list from a CMDB API and rewrites these files.
Prometheus Systemd Service
# /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=15d \
--storage.tsdb.retention.size=50GB \
--storage.tsdb.min-block-duration=2h \
--storage.tsdb.max-block-duration=2h \
--web.listen-address=0.0.0.0:9090 \
--web.enable-lifecycle \
--web.enable-admin-api \
--query.max-concurrency=20 \
--query.timeout=2m
Restart=always
RestartSec=5
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target --web.enable-lifecycleallows hot‑reloading via curl -X POST http://localhost:9090/-/reload. --web.enable-admin-api enables snapshot and delete operations and must be protected by firewall rules.
Node Exporter Installation
# Download and install Node Exporter 1.8.1
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xzf node_exporter-1.8.1.linux-amd64.tar.gz
sudo cp node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
# Systemd unit
cat <<'EOF' | sudo tee /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes \
--collector.tcpstat \
--collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc)($$|/)" \
--web.listen-address=:9100 \
--web.telemetry-path=/metrics
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporterThe --collector.filesystem.mount-points-exclude flag prevents collection of virtual filesystems, dramatically reducing series count.
Grafana Installation (APT)
# Add Grafana APT repository
sudo apt install -y apt-transport-https software-properties-common
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update
sudo apt install -y grafana
# Minimal configuration (grafana.ini)
cat <<'GRAFANA_EOF' | sudo tee /etc/grafana/grafana.ini
[server]
http_port = 3000
[database]
type = sqlite3
path = grafana.db
[security]
admin_user = admin
admin_password = P@ssw0rd_Change_Me
allow_sign_up = false
[auth.anonymous]
enabled = false
[dashboards]
min_refresh_interval = 10s
[alerting]
enabled = true
[unified_alerting]
enabled = true
GRAFANA_EOF
sudo systemctl enable --now grafana-serverGrafana Data Source Provisioning (API)
# Add Prometheus as the default data source
curl -X POST http://admin:P@ssw0rd_Change_Me@localhost:3000/api/datasources \
-H 'Content-Type: application/json' \
-d '{
"name": "Prometheus",
"type": "prometheus",
"url": "http://localhost:9090",
"access": "proxy",
"isDefault": true,
"jsonData": {
"timeInterval": "15s",
"queryTimeout": "60s",
"httpMethod": "POST"
}
}'Using POST instead of GET avoids URI‑length limits for complex queries; the team once hit a 414 error when a dashboard sent a long query.
Real‑World Cases
CMDB‑Driven Target Sync
#!/bin/bash
set -euo pipefail
CMDB_API="http://cmdb.internal:8080/api/v1/hosts"
CMDB_TOKEN="your-cmdb-api-token"
OUTPUT_DIR="/etc/prometheus/file_sd"
LOG_FILE="/var/log/prometheus/cmdb_sync.log"
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG_FILE"; }
response=$(curl -s -w "
%{http_code}" -H "Authorization: Bearer $CMDB_TOKEN" "$CMDB_API?status=running&page_size=1000")
http_code=$(echo "$response" | tail -1)
body=$(echo "$response" | head -n -1)
if [[ "$http_code" != "200" ]]; then
log "ERROR: CMDB API returned $http_code"
exit 1
fi
for role in app-server db-server cache-server gateway; do
echo "$body" | jq -r --arg role "$role" '[{targets: [.data[] | select(.role == $role) | .ip + ":9100"], labels: {env: "production", role: $role, dc: (.data[0].datacenter // "unknown")}}]' > "/tmp/${role}.yml"
count=$(echo "$body" | jq -r --arg role "$role" '[.data[] | select(.role == $role)] | length')
if (( count > 0 )); then
mv "/tmp/${role}.yml" "$OUTPUT_DIR/nodes_${role}.yml"
log "INFO: Synced $role with $count targets"
else
log "WARN: No targets for $role, skipped"
fi
done
rm -f /tmp/*.yml
log "INFO: CMDB sync completed"The script runs via cron */5 * * * * and guarantees that the file_sd files always reflect the current inventory.
Storage Capacity Planning Script
#!/bin/bash
PROM_URL="http://localhost:9090"
# Active series count
active_series=$(curl -s "$PROM_URL/api/v1/query?query=prometheus_tsdb_head_series" | jq -r '.data.result[0].value[1]')
echo "Active time series: $active_series"
# Samples per second (rate of appended samples)
samples_per_sec=$(curl -s "$PROM_URL/api/v1/query?query=rate(prometheus_tsdb_head_samples_appended_total[5m])" | jq -r '.data.result[0].value[1]' | xargs printf "%.0f")
echo "Samples per second: $samples_per_sec"
# Estimate daily storage (≈1.5 bytes per sample after compression)
bytes_per_sample=1.5
daily_bytes=$(echo "scale=2; $samples_per_sec*86400*$bytes_per_sample" | bc)
daily_gb=$(echo "scale=2; $daily_bytes/1024/1024/1024" | bc)
echo "Estimated daily data: $daily_gb GB"
for days in 7 15 30 90; do
total=$(echo "scale=2; $daily_gb*$days" | bc)
total_buf=$(echo "scale=2; $total*1.2" | bc) # 20 % safety buffer
echo "Retention $days days requires ≈ $total_buf GB (incl. 20 % buffer)"
doneThis script is used by the ops team to answer capacity‑planning questions on demand.
Best Practices and Caveats
Performance Optimisation
Storage optimisation : Set --storage.tsdb.min-block-duration and --storage.tsdb.max-block-duration to 2h when using Thanos. Keep local retention to 15 days; older data is queried via Thanos.
Recording Rules : Pre‑aggregate heavy queries. The team reduced a dashboard load time from 12 s to 0.8 s after adding rules for CPU, memory and disk utilisation.
Scrape interval tuning : Not every job needs 15 s. Infrastructure metrics can stay at 15 s, business‑level metrics at 10 s, and slow‑changing metrics (e.g., hardware info) at 60 s.
Label cardinality control : High‑cardinality labels (e.g., user_id) explode series count. An incident where a user_id label caused series to jump from 500 k to 8 M resulted in OOM.
Security Hardening
Basic Auth : Create /etc/prometheus/web.yml with bcrypt passwords and start Prometheus with --web.config.file=/etc/prometheus/web.yml.
basic_auth_users:
admin: $2a$12$KmR3iR5eJx5Oj5Yl5FpNOuJGQwMOsKOqJ7Mcp7hVQ8sKqGzLkjS6TLS encryption : Configure tls_server_config with server certificate, key and client‑CA for mutual TLS.
tls_server_config:
cert_file: /etc/prometheus/ssl/prometheus.crt
key_file: /etc/prometheus/ssl/prometheus.key
client_auth_type: RequireAndVerifyClientCert
client_ca_file: /etc/prometheus/ssl/ca.crtNetwork isolation : Bind Prometheus to an internal IP (e.g., --web.listen-address=10.0.1.40:9090). Expose Grafana via an Nginx reverse proxy with IP whitelist and WAF.
Admin API protection : Enable --web.enable-admin-api only when needed and restrict access via firewall or proxy.
High Availability
Dual‑instance Prometheus : Run two identical Prometheus servers and two Alertmanager instances. Alertmanager deduplicates alerts.
Thanos sidecar : Deploy a sidecar next to each Prometheus, upload blocks to S3, and query globally via Thanos Query. The team has run this setup for three years across five clusters.
Backup strategy : Use promtool tsdb snapshot or the admin API to create snapshots, store them on a separate volume, and rotate old backups.
Configuration Pitfalls
Changing --storage.tsdb.retention.time shortens data availability; ensure historical data is no longer needed before reducing.
Modifying external_labels after data has been written breaks Thanos federation and deduplication.
Incorrect relabel_configs can unintentionally drop targets or overwrite labels. Always validate with promtool check config and reload via curl -X POST http://localhost:9090/-/reload.
Self‑Monitoring
Key Metrics Queries
# Scrape latency (99th percentile)
curl -s "http://localhost:9090/api/v1/query?query=prometheus_target_interval_length_seconds{quantile=\"0.99\"}" | jq .
# Query engine latency (99th percentile)
curl -s "http://localhost:9090/api/v1/query?query=prometheus_engine_query_duration_seconds{quantile=\"0.99\"}" | jq .
# WAL size
curl -s "http://localhost:9090/api/v1/query?query=prometheus_tsdb_wal_storage_size_bytes" | jq .
# Process memory usage
curl -s "http://localhost:9090/api/v1/query?query=process_resident_memory_bytes{job=\"prometheus\"}" | jq .
# Scrape failures
curl -s "http://localhost:9090/api/v1/query?query=sum(up{job=\"node-exporter\"}==0)" | jq .Self‑Monitoring Alert Rules (prometheus_self_rules.yml)
groups:
- name: prometheus_self_monitoring
rules:
- alert: PrometheusTargetDown
expr: up{job="prometheus"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Prometheus instance {{ $labels.instance }} is down"
- alert: PrometheusHighMemory
expr: process_resident_memory_bytes{job="prometheus"} / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus memory usage exceeds 80%"
- alert: PrometheusHighQueryDuration
expr: prometheus_engine_query_duration_seconds{quantile="0.99"} > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Prometheus P99 query latency > 10s"
- alert: PrometheusTSDBCompactionsFailed
expr: increase(prometheus_tsdb_compactions_failed_total[1h]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus TSDB compaction failed"
- alert: PrometheusHighCardinality
expr: prometheus_tsdb_head_series > 5000000
for: 10m
labels:
severity: warning
annotations:
summary: "Time series count exceeds 5 M"Troubleshooting
Common Issues and Fixes
TSDB corruption after power loss : Run promtool tsdb repair /var/lib/prometheus. If repair fails, stop Prometheus, move the wal directory aside, create an empty wal directory and restart.
OOM kills : Monitor prometheus_tsdb_head_series. When series exceed 5 M, investigate high‑cardinality metrics, drop unnecessary labels via metric_relabel_configs, or split the workload across multiple Prometheus instances.
Target shows "DOWN" but service is reachable : Verify firewall rules, ensure the exporter binds to 0.0.0.0, check scrape_timeout (increase if exporter is slow), and confirm the address discovered by service discovery is correct.
High cardinality performance degradation : Use the TSDB status API to list metrics with the most series, then either remove the high‑cardinality label at source or drop it with metric_relabel_configs. For already stored data, delete the series via the admin API and run clean_tombstones.
Long scrape intervals ("context deadline exceeded") : Increase scrape_timeout or optimise the exporter to respond faster.
Backup and Restore
Backup Script (snapshot + tar)
#!/bin/bash
set -euo pipefail
PROM_URL="http://localhost:9090"
BACKUP_DIR="/data/backup/prometheus"
TSDB_PATH="/var/lib/prometheus"
KEEP_DAYS=7
DATE=$(date +%Y%m%d_%H%M%S)
LOG_FILE="/var/log/prometheus/backup.log"
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG_FILE"; }
log "Creating TSDB snapshot"
resp=$(curl -s -X POST "$PROM_URL/api/v1/admin/tsdb/snapshot")
snap=$(echo "$resp" | jq -r '.data.name')
if [[ -z "$snap" || "$snap" == "null" ]]; then
log "ERROR: Snapshot failed: $resp"
exit 1
fi
log "Snapshot $snap created"
mkdir -p "$BACKUP_DIR"
tar czf "$BACKUP_DIR/prometheus_snapshot_${DATE}.tar.gz" -C "$TSDB_PATH/snapshots" "$snap"
backup_size=$(du -sh "$BACKUP_DIR/prometheus_snapshot_${DATE}.tar.gz" | awk '{print $1}')
log "Backup size: $backup_size"
# Clean up snapshot directory
rm -rf "$TSDB_PATH/snapshots/$snap"
# Delete backups older than KEEP_DAYS
find "$BACKUP_DIR" -name "prometheus_snapshot_*.tar.gz" -mtime +$KEEP_DAYS -delete
log "Backup completed"Restore Procedure
Stop Prometheus: sudo systemctl stop prometheus Backup the current data directory and extract the snapshot:
sudo mv /var/lib/prometheus /var/lib/prometheus.old
sudo mkdir -p /var/lib/prometheus
sudo tar xzf /data/backup/prometheus/prometheus_snapshot_20240101_030000.tar.gz -C /var/lib/prometheus --strip-components=1
sudo chown -R prometheus:prometheus /var/lib/prometheusValidate the TSDB integrity: promtool tsdb list /var/lib/prometheus Start Prometheus and verify data: sudo systemctl start prometheus then query up or any custom metric.
Conclusion
Prometheus' pull model provides immediate detection of target failures; a 15 s scrape interval balances freshness and CPU load.
Series cardinality is the primary scalability factor – keep label values low and avoid high‑cardinality identifiers.
Recording Rules dramatically improve dashboard performance; the team reduced average load time from 6 s to 1.2 s.
Grafana provisioning enables configuration‑as‑code for data sources and dashboards.
For small setups dual‑instance Prometheus + Alertmanager is sufficient; for larger clusters adopt Thanos or VictoriaMetrics for global query and long‑term storage.
Basic auth, TLS and API access control are mandatory for production deployments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
