Operations 49 min read

Eliminate Monitoring Blind Spots: Hands‑On Enterprise‑Grade Prometheus + Grafana Deployment

This comprehensive guide walks you through the end‑to‑end setup of a production‑grade Prometheus and Grafana monitoring stack, covering architecture choices, installation steps, configuration details, high‑availability designs, performance tuning, security hardening, troubleshooting, backup strategies, and best‑practice recommendations.

Raymond Ops
Raymond Ops
Raymond Ops
Eliminate Monitoring Blind Spots: Hands‑On Enterprise‑Grade Prometheus + Grafana Deployment

Overview

In a production environment with several hundred machines the team moved from manual checks to a Prometheus + Grafana stack in 2019. After more than five years the system monitors host, container, middleware and business metrics, ingesting over 20 million samples per day.

Prometheus uses a pull model, giving the monitoring side immediate visibility when a target disappears. Its built‑in TSDB can write millions of samples per second and serve queries in milliseconds. Grafana provides rich visualisation and, together with Alertmanager, completes the monitoring pipeline.

Key Features

Pull model + service discovery : Prometheus actively scrapes targets and integrates with Consul, Kubernetes and file‑based discovery. Over 400 micro‑service instances are registered automatically via Kubernetes.

PromQL : Vector operations, aggregation and prediction functions such as

predict_linear(node_filesystem_avail_bytes[6h], 24*3600) < 0

enable 24‑hour disk‑space warnings.

Local TSDB + remote storage : Data is stored locally by default. For larger volumes Thanos or VictoriaMetrics can be attached for long‑term storage. The production setup keeps 15 days of hot data locally and syncs older data to S3 via a Thanos Sidecar.

Environment Requirements

OS: CentOS 7+ / Ubuntu 20.04+ (Ubuntu 22.04 LTS recommended for cgroup v2 support)

CPU / RAM: 4 CPU 8 GB RAM (8 CPU 16 GB for >1000 targets)

Storage: SSD, at least 100 GB for TSDB data

Prometheus: 2.45+ (LTS) or 2.53+

Grafana: 10.0+ (10.2+ recommended)

Node Exporter: 1.7+ (versions < 1.6 have memory leaks on ARM)

Installation and Configuration

Preparation

Check OS version, CPU, memory and reserve >100 GB for TSDB.

Ensure NTP is enabled; time drift >1 min corrupts data.

Create a non‑login prometheus user and required directories ( /etc/prometheus, /var/lib/prometheus, /etc/prometheus/file_sd).

Open firewall ports: 9090/tcp (Prometheus), 3000/tcp (Grafana), 9100/tcp (Node Exporter).

Prometheus Installation (binary)

# Download Prometheus 2.53.0
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz

# Extract and install binaries
tar xzf prometheus-2.53.0.linux-amd64.tar.gz
cd prometheus-2.53.0.linux-amd64
sudo cp prometheus promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool

# Copy console templates
sudo cp -r consoles /etc/prometheus/
sudo cp -r console_libraries /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus/consoles /etc/prometheus/console_libraries

# Verify installation
prometheus --version

Main Configuration (prometheus.yml)

global:
  scrape_interval: 15s   # Balanced load and freshness for 200‑800 targets
  evaluation_interval: 15s
  scrape_timeout: 10s
  external_labels:
    cluster: 'prod-bj'
    environment: 'production'

rule_files:
  - "/etc/prometheus/rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['127.0.0.1:9093']
          timeout: 10s

scrape_configs:
  # Prometheus self‑monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          instance: 'prometheus-server'

  # Node Exporter – host metrics
  - job_name: 'node-exporter'
    file_sd_configs:
      - files: ['/etc/prometheus/file_sd/nodes.yml']
    refresh_interval: 30s
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.+):([0-9]+)'
        target_label: hostname
        replacement: '${1}'

  # cAdvisor – container metrics
  - job_name: 'cadvisor'
    file_sd_configs:
      - files: ['/etc/prometheus/file_sd/cadvisor.yml']
    refresh_interval: 30s

  # Application custom metrics (Spring Boot actuator example)
  - job_name: 'app-metrics'
    metrics_path: '/actuator/prometheus'
    file_sd_configs:
      - files: ['/etc/prometheus/file_sd/apps.yml']
    refresh_interval: 30s
    relabel_configs:
      - source_labels: [__meta_filepath]
        regex: '.*/(.+)\.yml'
        target_label: source_file

Why 15 s? Tests showed that 10 s caused noticeable CPU increase when >500 targets were scraped, while 30 s missed short‑lived spikes. Fifteen seconds is the best trade‑off for medium‑scale deployments.

File‑Based Service Discovery

# Example node list (file_sd/nodes.yml)
- targets:
    - '10.0.1.10:9100'
    - '10.0.1.11:9100'
    - '10.0.1.12:9100'
    - '10.0.1.13:9100'
    - '10.0.1.14:9100'
  labels:
    env: production
    dc: beijing
    role: app-server

- targets:
    - '10.0.2.10:9100'
    - '10.0.2.11:9100'
    - '10.0.2.12:9100'
  labels:
    env: production
    dc: beijing
    role: db-server

The file is refreshed automatically; the team runs a Bash script every five minutes that pulls the host list from a CMDB API and rewrites these files.

Prometheus Systemd Service

# /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus Monitoring System
Documentation=https://prometheus.io/docs/
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus \
  --storage.tsdb.retention.time=15d \
  --storage.tsdb.retention.size=50GB \
  --storage.tsdb.min-block-duration=2h \
  --storage.tsdb.max-block-duration=2h \
  --web.listen-address=0.0.0.0:9090 \
  --web.enable-lifecycle \
  --web.enable-admin-api \
  --query.max-concurrency=20 \
  --query.timeout=2m
Restart=always
RestartSec=5
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target
--web.enable-lifecycle

allows hot‑reloading via curl -X POST http://localhost:9090/-/reload. --web.enable-admin-api enables snapshot and delete operations and must be protected by firewall rules.

Node Exporter Installation

# Download and install Node Exporter 1.8.1
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar xzf node_exporter-1.8.1.linux-amd64.tar.gz
sudo cp node_exporter-1.8.1.linux-amd64/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

# Systemd unit
cat <<'EOF' | sudo tee /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.tcpstat \
  --collector.filesystem.mount-points-exclude="^/(sys|proc|dev|host|etc)($$|/)" \
  --web.listen-address=:9100 \
  --web.telemetry-path=/metrics
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

The --collector.filesystem.mount-points-exclude flag prevents collection of virtual filesystems, dramatically reducing series count.

Grafana Installation (APT)

# Add Grafana APT repository
sudo apt install -y apt-transport-https software-properties-common
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee /etc/apt/sources.list.d/grafana.list

sudo apt update
sudo apt install -y grafana

# Minimal configuration (grafana.ini)
cat <<'GRAFANA_EOF' | sudo tee /etc/grafana/grafana.ini
[server]
http_port = 3000

[database]
type = sqlite3
path = grafana.db

[security]
admin_user = admin
admin_password = P@ssw0rd_Change_Me
allow_sign_up = false

[auth.anonymous]
enabled = false

[dashboards]
min_refresh_interval = 10s

[alerting]
enabled = true

[unified_alerting]
enabled = true
GRAFANA_EOF

sudo systemctl enable --now grafana-server

Grafana Data Source Provisioning (API)

# Add Prometheus as the default data source
curl -X POST http://admin:P@ssw0rd_Change_Me@localhost:3000/api/datasources \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "Prometheus",
    "type": "prometheus",
    "url": "http://localhost:9090",
    "access": "proxy",
    "isDefault": true,
    "jsonData": {
      "timeInterval": "15s",
      "queryTimeout": "60s",
      "httpMethod": "POST"
    }
  }'

Using POST instead of GET avoids URI‑length limits for complex queries; the team once hit a 414 error when a dashboard sent a long query.

Real‑World Cases

CMDB‑Driven Target Sync

#!/bin/bash
set -euo pipefail
CMDB_API="http://cmdb.internal:8080/api/v1/hosts"
CMDB_TOKEN="your-cmdb-api-token"
OUTPUT_DIR="/etc/prometheus/file_sd"
LOG_FILE="/var/log/prometheus/cmdb_sync.log"

log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG_FILE"; }

response=$(curl -s -w "
%{http_code}" -H "Authorization: Bearer $CMDB_TOKEN" "$CMDB_API?status=running&page_size=1000")
http_code=$(echo "$response" | tail -1)
body=$(echo "$response" | head -n -1)

if [[ "$http_code" != "200" ]]; then
  log "ERROR: CMDB API returned $http_code"
  exit 1
fi

for role in app-server db-server cache-server gateway; do
  echo "$body" | jq -r --arg role "$role" '[{targets: [.data[] | select(.role == $role) | .ip + ":9100"], labels: {env: "production", role: $role, dc: (.data[0].datacenter // "unknown")}}]' > "/tmp/${role}.yml"
  count=$(echo "$body" | jq -r --arg role "$role" '[.data[] | select(.role == $role)] | length')
  if (( count > 0 )); then
    mv "/tmp/${role}.yml" "$OUTPUT_DIR/nodes_${role}.yml"
    log "INFO: Synced $role with $count targets"
  else
    log "WARN: No targets for $role, skipped"
  fi
done
rm -f /tmp/*.yml
log "INFO: CMDB sync completed"

The script runs via cron */5 * * * * and guarantees that the file_sd files always reflect the current inventory.

Storage Capacity Planning Script

#!/bin/bash
PROM_URL="http://localhost:9090"

# Active series count
active_series=$(curl -s "$PROM_URL/api/v1/query?query=prometheus_tsdb_head_series" | jq -r '.data.result[0].value[1]')

echo "Active time series: $active_series"

# Samples per second (rate of appended samples)
samples_per_sec=$(curl -s "$PROM_URL/api/v1/query?query=rate(prometheus_tsdb_head_samples_appended_total[5m])" | jq -r '.data.result[0].value[1]' | xargs printf "%.0f")

echo "Samples per second: $samples_per_sec"

# Estimate daily storage (≈1.5 bytes per sample after compression)
bytes_per_sample=1.5
daily_bytes=$(echo "scale=2; $samples_per_sec*86400*$bytes_per_sample" | bc)
daily_gb=$(echo "scale=2; $daily_bytes/1024/1024/1024" | bc)

echo "Estimated daily data: $daily_gb GB"

for days in 7 15 30 90; do
  total=$(echo "scale=2; $daily_gb*$days" | bc)
  total_buf=$(echo "scale=2; $total*1.2" | bc)  # 20 % safety buffer
  echo "Retention $days days requires ≈ $total_buf GB (incl. 20 % buffer)"
done

This script is used by the ops team to answer capacity‑planning questions on demand.

Best Practices and Caveats

Performance Optimisation

Storage optimisation : Set --storage.tsdb.min-block-duration and --storage.tsdb.max-block-duration to 2h when using Thanos. Keep local retention to 15 days; older data is queried via Thanos.

Recording Rules : Pre‑aggregate heavy queries. The team reduced a dashboard load time from 12 s to 0.8 s after adding rules for CPU, memory and disk utilisation.

Scrape interval tuning : Not every job needs 15 s. Infrastructure metrics can stay at 15 s, business‑level metrics at 10 s, and slow‑changing metrics (e.g., hardware info) at 60 s.

Label cardinality control : High‑cardinality labels (e.g., user_id) explode series count. An incident where a user_id label caused series to jump from 500 k to 8 M resulted in OOM.

Security Hardening

Basic Auth : Create /etc/prometheus/web.yml with bcrypt passwords and start Prometheus with --web.config.file=/etc/prometheus/web.yml.

basic_auth_users:
  admin: $2a$12$KmR3iR5eJx5Oj5Yl5FpNOuJGQwMOsKOqJ7Mcp7hVQ8sKqGzLkjS6

TLS encryption : Configure tls_server_config with server certificate, key and client‑CA for mutual TLS.

tls_server_config:
  cert_file: /etc/prometheus/ssl/prometheus.crt
  key_file: /etc/prometheus/ssl/prometheus.key
  client_auth_type: RequireAndVerifyClientCert
  client_ca_file: /etc/prometheus/ssl/ca.crt

Network isolation : Bind Prometheus to an internal IP (e.g., --web.listen-address=10.0.1.40:9090). Expose Grafana via an Nginx reverse proxy with IP whitelist and WAF.

Admin API protection : Enable --web.enable-admin-api only when needed and restrict access via firewall or proxy.

High Availability

Dual‑instance Prometheus : Run two identical Prometheus servers and two Alertmanager instances. Alertmanager deduplicates alerts.

Thanos sidecar : Deploy a sidecar next to each Prometheus, upload blocks to S3, and query globally via Thanos Query. The team has run this setup for three years across five clusters.

Backup strategy : Use promtool tsdb snapshot or the admin API to create snapshots, store them on a separate volume, and rotate old backups.

Configuration Pitfalls

Changing --storage.tsdb.retention.time shortens data availability; ensure historical data is no longer needed before reducing.

Modifying external_labels after data has been written breaks Thanos federation and deduplication.

Incorrect relabel_configs can unintentionally drop targets or overwrite labels. Always validate with promtool check config and reload via curl -X POST http://localhost:9090/-/reload.

Self‑Monitoring

Key Metrics Queries

# Scrape latency (99th percentile)
curl -s "http://localhost:9090/api/v1/query?query=prometheus_target_interval_length_seconds{quantile=\"0.99\"}" | jq .

# Query engine latency (99th percentile)
curl -s "http://localhost:9090/api/v1/query?query=prometheus_engine_query_duration_seconds{quantile=\"0.99\"}" | jq .

# WAL size
curl -s "http://localhost:9090/api/v1/query?query=prometheus_tsdb_wal_storage_size_bytes" | jq .

# Process memory usage
curl -s "http://localhost:9090/api/v1/query?query=process_resident_memory_bytes{job=\"prometheus\"}" | jq .

# Scrape failures
curl -s "http://localhost:9090/api/v1/query?query=sum(up{job=\"node-exporter\"}==0)" | jq .

Self‑Monitoring Alert Rules (prometheus_self_rules.yml)

groups:
- name: prometheus_self_monitoring
  rules:
  - alert: PrometheusTargetDown
    expr: up{job="prometheus"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus instance {{ $labels.instance }} is down"

  - alert: PrometheusHighMemory
    expr: process_resident_memory_bytes{job="prometheus"} / node_memory_MemTotal_bytes * 100 > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Prometheus memory usage exceeds 80%"

  - alert: PrometheusHighQueryDuration
    expr: prometheus_engine_query_duration_seconds{quantile="0.99"} > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Prometheus P99 query latency > 10s"

  - alert: PrometheusTSDBCompactionsFailed
    expr: increase(prometheus_tsdb_compactions_failed_total[1h]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Prometheus TSDB compaction failed"

  - alert: PrometheusHighCardinality
    expr: prometheus_tsdb_head_series > 5000000
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Time series count exceeds 5 M"

Troubleshooting

Common Issues and Fixes

TSDB corruption after power loss : Run promtool tsdb repair /var/lib/prometheus. If repair fails, stop Prometheus, move the wal directory aside, create an empty wal directory and restart.

OOM kills : Monitor prometheus_tsdb_head_series. When series exceed 5 M, investigate high‑cardinality metrics, drop unnecessary labels via metric_relabel_configs, or split the workload across multiple Prometheus instances.

Target shows "DOWN" but service is reachable : Verify firewall rules, ensure the exporter binds to 0.0.0.0, check scrape_timeout (increase if exporter is slow), and confirm the address discovered by service discovery is correct.

High cardinality performance degradation : Use the TSDB status API to list metrics with the most series, then either remove the high‑cardinality label at source or drop it with metric_relabel_configs. For already stored data, delete the series via the admin API and run clean_tombstones.

Long scrape intervals ("context deadline exceeded") : Increase scrape_timeout or optimise the exporter to respond faster.

Backup and Restore

Backup Script (snapshot + tar)

#!/bin/bash
set -euo pipefail
PROM_URL="http://localhost:9090"
BACKUP_DIR="/data/backup/prometheus"
TSDB_PATH="/var/lib/prometheus"
KEEP_DAYS=7
DATE=$(date +%Y%m%d_%H%M%S)
LOG_FILE="/var/log/prometheus/backup.log"

log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> "$LOG_FILE"; }

log "Creating TSDB snapshot"
resp=$(curl -s -X POST "$PROM_URL/api/v1/admin/tsdb/snapshot")
snap=$(echo "$resp" | jq -r '.data.name')
if [[ -z "$snap" || "$snap" == "null" ]]; then
  log "ERROR: Snapshot failed: $resp"
  exit 1
fi
log "Snapshot $snap created"

mkdir -p "$BACKUP_DIR"
tar czf "$BACKUP_DIR/prometheus_snapshot_${DATE}.tar.gz" -C "$TSDB_PATH/snapshots" "$snap"
backup_size=$(du -sh "$BACKUP_DIR/prometheus_snapshot_${DATE}.tar.gz" | awk '{print $1}')
log "Backup size: $backup_size"

# Clean up snapshot directory
rm -rf "$TSDB_PATH/snapshots/$snap"

# Delete backups older than KEEP_DAYS
find "$BACKUP_DIR" -name "prometheus_snapshot_*.tar.gz" -mtime +$KEEP_DAYS -delete
log "Backup completed"

Restore Procedure

Stop Prometheus: sudo systemctl stop prometheus Backup the current data directory and extract the snapshot:

sudo mv /var/lib/prometheus /var/lib/prometheus.old
sudo mkdir -p /var/lib/prometheus
sudo tar xzf /data/backup/prometheus/prometheus_snapshot_20240101_030000.tar.gz -C /var/lib/prometheus --strip-components=1
sudo chown -R prometheus:prometheus /var/lib/prometheus

Validate the TSDB integrity: promtool tsdb list /var/lib/prometheus Start Prometheus and verify data: sudo systemctl start prometheus then query up or any custom metric.

Conclusion

Prometheus' pull model provides immediate detection of target failures; a 15 s scrape interval balances freshness and CPU load.

Series cardinality is the primary scalability factor – keep label values low and avoid high‑cardinality identifiers.

Recording Rules dramatically improve dashboard performance; the team reduced average load time from 6 s to 1.2 s.

Grafana provisioning enables configuration‑as‑code for data sources and dashboards.

For small setups dual‑instance Prometheus + Alertmanager is sufficient; for larger clusters adopt Thanos or VictoriaMetrics for global query and long‑term storage.

Basic auth, TLS and API access control are mandatory for production deployments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringHigh AvailabilityKubernetesAlertingPrometheusTSDBGrafana
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.