Operations 25 min read

End‑to‑End Prometheus Monitoring: Deployment, Tuning, HA & Troubleshooting

This guide walks through the complete Prometheus monitoring lifecycle—from binary, Docker, and Kubernetes deployments to Ansible‑driven node_exporter rollout, SNMP switch and router monitoring, alert routing via WeChat, SMS and email, production‑grade tuning, high‑availability designs, and systematic troubleshooting.

AI Agent Super App
AI Agent Super App
AI Agent Super App
End‑to‑End Prometheus Monitoring: Deployment, Tuning, HA & Troubleshooting

What is Prometheus

Prometheus is an open‑source monitoring system and time‑series database originally created at SoundCloud in 2012 and now a CNCF project. It differs from traditional pull‑based systems like Zabbix by using a pull model, a multi‑dimensional label‑based data model, the powerful PromQL query language, automatic service discovery for Kubernetes and Consul, and a rich ecosystem of exporters.

1. Binary Installation (basic method)

Suitable for small or test environments. The steps are:

# Download the latest version (example v2.53.0)
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz
# Extract
 tar xzf prometheus-2.53.0.linux-amd64.tar.gz
cd prometheus-2.53.0.linux-amd64
# Create data and config directories
mkdir -p /data/prometheus/data /etc/prometheus
cp prometheus.yml /etc/prometheus/
cp prometheus promtool /usr/local/bin/

Configuration ( /etc/prometheus/prometheus.yml) defines global scrape intervals, alertmanager targets, rule files and scrape jobs for Prometheus itself and node_exporter.

global:
  scrape_interval: 15s
  evaluation_interval: 15s
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["localhost:9093"]
rule_files:
  - "/etc/prometheus/rules/*.yml"
scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "node"
    static_configs:
      - targets: ["localhost:9100"]

A Systemd unit ( /etc/systemd/system/prometheus.service) runs Prometheus with storage path, retention of 30 days, and enables hot‑reloading via --web.enable-lifecycle (curl -X POST localhost:9090/-/reload).

2. Docker Deployment (recommended for < 50 servers)

Run a single container:

mkdir -p /data/prometheus/{data,config,rules}

docker run -d \
  --name prometheus \
  --restart always \
  -p 9090:9090 \
  -v /data/prometheus/config:/etc/prometheus \
  -v /data/prometheus/data:/prometheus \
  -v /data/prometheus/rules:/etc/prometheus/rules \
  prom/prometheus:v2.53.0 \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/prometheus \
  --storage.tsdb.retention.time=30d \
  --web.enable-lifecycle

For a full stack, a docker‑compose.yml defines Prometheus, Grafana and Alertmanager services with shared volumes and environment variables.

3. Kubernetes Deployment (production‑grade)

The Prometheus Operator (installed via the kube‑prometheus repo) manages CRDs for Prometheus, Alertmanager, ServiceMonitors and PrometheusRules, providing declarative configuration.

# Clone and apply CRDs
git clone https://github.com/prometheus-operator/kube-prometheus.git
cd kube-prometheus
kubectl apply --server-side -f manifests/setup/
kubectl apply -f manifests/

A custom ServiceMonitor selects services by labels, e.g.:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp-monitor
  namespace: monitoring
  labels:
    release: kube-prometheus
spec:
  selector:
    matchLabels:
      app: myapp
  namespaceSelector:
    matchNames: ["default"]
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

Alerting rules are stored in PrometheusRule objects, for example a node‑down alert:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: node-alerts
  namespace: monitoring
spec:
  groups:
  - name: node
    rules:
    - alert: NodeDown
      expr: up{job="node"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Node {{ $labels.instance }} is down"
    - alert: HighCPU
      expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "CPU usage > 85%"

Persistent storage is required; a patch adds an nfs-storage PVC with 50 Gi.

4. High‑Availability Options

Dual Prometheus + Load Balancer : Deploy two identical Prometheus instances behind HAProxy or Nginx; Alertmanager deduplicates alerts.

Thanos (large‑scale) : Run a sidecar next to each Prometheus, ship data to object storage (S3/OSS/MinIO), and query globally via Thanos Querier. Example sidecar command:

thanos sidecar \
  --tsdb.path=/prometheus \
  --prometheus.url=http://localhost:9090 \
  --objstore.config-file=/etc/thanos/objstore.yml

VictoriaMetrics (lightweight) : Single binary compatible with PromQL, better performance, native clustering. Remote‑write from Prometheus:

remote_write:
  - url: "http://victoriametrics:8428/api/v1/write"

5. Batch Installation of node_exporter with Ansible

Define an inventory ( /etc/ansible/hosts) grouping web and DB servers, then a playbook ( install_node_exporter.yml) that downloads, extracts, copies the binary, creates a Systemd unit, and starts the service.

# /etc/ansible/hosts
[webservers]
web01 ansible_host=192.168.1.10
web02 ansible_host=192.168.1.11
web03 ansible_host=192.168.1.12
[dbservers]
db01 ansible_host=192.168.1.20
db02 ansible_host=192.168.1.21
[all:vars]
ansible_user=root
ansible_ssh_private_key_file=~/.ssh/id_rsa
# install_node_exporter.yml (excerpt)
- name: Install Node Exporter
  hosts: all
  become: yes
  vars:
    node_exporter_version: "1.8.1"
  tasks:
    - name: Download node_exporter
      get_url:
        url: "https://github.com/prometheus/node_exporter/releases/download/v{{ node_exporter_version }}/node_exporter-{{ node_exporter_version }}.linux-amd64.tar.gz"
        dest: /tmp/node_exporter.tar.gz
    - name: Extract node_exporter
      unarchive:
        src: /tmp/node_exporter.tar.gz
        dest: /tmp/
        remote_src: yes
    - name: Copy binary
      copy:
        src: /tmp/node_exporter-{{ node_exporter_version }}.linux-amd64/node_exporter
        dest: /usr/local/bin/node_exporter
        mode: "0755"
    - name: Create systemd service
      copy:
        content: |
          [Unit]
          Description=Node Exporter
          After=network.target
          [Service]
          ExecStart=/usr/local/bin/node_exporter
          Restart=on-failure
          [Install]
          WantedBy=multi-user.target
        dest: /etc/systemd/system/node_exporter.service
    - name: Start service
      systemd:
        name: node_exporter
        state: started
        enabled: yes
        daemon_reload: yes

After the playbook runs, generate file_sd target files (JSON) and trigger hot‑reload:

# Example target file for webservers
[
  {
    "targets": ["192.168.1.10:9100", "192.168.1.11:9100"],
    "labels": {"job": "node", "group": "webservers"}
  }
]
# Reload Prometheus configuration
curl -X POST http://localhost:9090/-/reload

6. Monitoring Network Devices via SNMP

Switches and firewalls expose metrics only through SNMP. Deploy snmp_exporter and generate its snmp.yml with the generator tool.

# Example generator.yml fragment
modules:
  if_mib:
    walk:
      - sysUpTime
      - interfaces
      - ip
    lookups:
      - source_indexes: [ifIndex]
        lookup: ifAlias
      - source_indexes: [ifIndex]
        lookup: ifDescr

Run the generator in a container to produce snmp.yml, then start the exporter:

docker run -d \
  --name snmp_exporter \
  -p 9116:9116 \
  -v ./snmp.yml:/etc/snmp_exporter/snmp.yml \
  prom/snmp-exporter:v0.26.0

Add a scrape job for switches:

scrape_configs:
  - job_name: "switch"
    snmp:
      target: 192.168.1.1
      module: [if_mib]
    static_configs:
      - targets: ["192.168.1.1", "192.168.1.2", "192.168.1.3"]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: snmp_exporter:9116

Router monitoring follows the same pattern; for Linux‑based OpenWrt routers you can install node_exporter directly.

7. Alert Routing

WeChat : Create a robot in a corporate group, obtain a webhook URL, and configure Alertmanager with a webhook_config that forwards alerts to a small Python relay.

# alertmanager.yml (excerpt)
receivers:
  - name: "wechat"
    webhook_configs:
      - url: "http://wechat-relay:8060/wechat"
        send_resolved: true
# Minimal Flask relay (excerpt)
from flask import Flask, request
import requests
app = Flask(__name__)
WECHAT_WEBHOOK = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY"
@app.route("/wechat", methods=["POST"])
def wechat_alert():
    data = request.json
    for alert in data.get("alerts", []):
        msg = {"msgtype": "markdown", "markdown": {"content": f"**Alert**: {alert['labels']['alertname']}
**Level**: {alert['labels'].get('severity','unknown')}
**Instance**: {alert['labels'].get('instance','')}
**Desc**: {alert['annotations'].get('summary','')}
**Time**: {alert['startsAt']}"}}
        requests.post(WECHAT_WEBHOOK, json=msg)
    return "ok"
if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8060)

SMS (critical alerts) : Use a cloud‑SMS SDK (e.g., Alibaba Cloud) inside a webhook service to send short messages. The article provides a Python snippet that builds a SendSmsRequest with alert name and instance.

Email (record‑keeping) : Configure Alertmanager’s native email_config with SMTP credentials, subject templating and an HTML template that lists alerts in a table.

8. Production Tuning

Scrape interval tiering : Critical services every 10 s, normal services every 30 s, low‑priority every 60 s.

Storage flags : Retention 30 d, optional size limit, WAL compression, block duration tuning.

Query optimisation : Avoid wide‑range queries, use recording rules (e.g., job:node_cpu_usage:avg_rate5m), limit series with topk() or limit, drop unnecessary labels via metric_relabel_configs.

Resource limits : Systemd MemoryMax=4G, CPUQuota=200%; Kubernetes requests and limits (2‑4 CPU, 4‑8 Gi memory for a typical instance).

9. Common Troubleshooting

Prometheus fails to start : Run promtool check config, verify data directory ownership, inspect journalctl -u prometheus.

Target unreachable : Check the Targets page in the UI, test with curl http://target:9100/metrics, verify firewall rules and exporter process.

Disk exhaustion : Inspect du -sh /data/prometheus/data/*, optionally delete old blocks via the admin API or shorten retention.

High memory usage : Query /api/v1/status/runtimeinfo, look for high series count (high cardinality), frequent large queries, or excessive labels; mitigate with recording rules or label relabeling.

Alert not firing or duplicate : Validate rules with promtool check rules, examine current alerts via /api/v1/alerts, ensure for durations are appropriate and Alertmanager routing matches.

10. Sizing Guidance

Resource recommendations based on monitored host count and time‑series volume:

Small (< 50 hosts, < 1 M series): 2 CPU, 4 GiB RAM, 50 GiB SSD.

Medium (50‑500 hosts, 1‑5 M series): 4 CPU, 8 GiB RAM, 200 GiB SSD.

Large (500‑2000 hosts, 5‑20 M series): 8 CPU, 16 GiB RAM, 500 GiB SSD.

Very large (> 2000 hosts): Deploy Thanos or VictoriaMetrics in a multi‑node topology (each node 8 CPU/16 GiB/500 GiB).

Rough storage formula: each series consumes 3‑5 bytes per hour after compression. Example calculation for 100 hosts × 1000 metrics at a 15 s scrape interval yields ~52 GiB for 30 days, so a 70 GiB disk provides safety margin.

Conclusion

Prometheus can cover monitoring from a single server to a massive Kubernetes fleet. Binary installs are quick for pilots, Docker Compose fits mid‑size setups, the Operator delivers production‑grade management, and Thanos/VictoriaMetrics enable scale‑out HA. Alerting should use WeChat for routine notifications, SMS for critical incidents, and email for archival. Proper tuning—scrape intervals, storage flags, recording rules, and resource limits—keeps the system performant, while the troubleshooting checklist helps resolve common issues efficiently.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringKubernetesPrometheusVictoriaMetricsansibleAlertmanagerThanos
AI Agent Super App
Written by

AI Agent Super App

AI agent applications, installation, large-model testing, computer fundamentals, IT operations and maintenance exchange, network technology exchange, Linux learning

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.