Operations 51 min read

Production-Ready Prometheus Alerting: 50+ Core Metrics & Best Practices

This guide details production‑grade Prometheus alerting configurations, covering applicable scenarios, prerequisites, anti‑patterns, environment matrices, step‑by‑step deployment of Node Exporter, Prometheus and Alertmanager, comprehensive rule files, performance testing, troubleshooting, best practices, and ready‑to‑use scripts for backup and health checks.

MaGe Linux Operations

Nov 17, 2025

Production-Ready Prometheus Alerting: 50+ Core Metrics & Best Practices

Applicable Scenarios & Prerequisites

The solution is intended for production‑environment micro‑service monitoring, cloud‑native observability, and infrastructure health checks. Required OS: RHEL/CentOS 7.9+ or Ubuntu 20.04+. Kernel 4.18+. Prometheus 2.40+, Alertmanager 0.25+, Node Exporter 1.5+. Minimum resources: 4 CPU 8 GB RAM, 100 GB SSD (15‑day retention). Open ports 9090, 9093, 9100. Users need PromQL, YAML, micro‑service architecture, and monitoring theory knowledge.

Anti‑Pattern Warnings

Very small environments (<10 servers) – use Zabbix/Nagios.

Require auto‑remediation – combine with Ansible/Kubernetes Operator.

Long‑term data storage – integrate Thanos/Cortex/VictoriaMetrics.

Log‑centric analysis – use ELK/Loki.

Windows‑centric servers – Node Exporter has limited Windows support; use WMI Exporter.

Sub‑second latency requirements – Prometheus minimum scrape interval is 5‑15 s.

Alternative Solution Comparison

APM monitoring – Jaeger/SkyWalking (stronger distributed tracing).

Log aggregation – ELK/Loki (full‑text search and log analysis).

Long‑term storage – Thanos/VictoriaMetrics (multi‑year retention).

Traditional infrastructure – Zabbix (mature SNMP/IPMI support).

Environment & Version Matrix

OS: RHEL 8.7+ / CentOS Stream 9 or Ubuntu 22.04 LTS (tested).

Kernel: 4.18.0‑425+ (RHEL) / 5.15.0‑60+ (Ubuntu).

Prometheus: 2.40.7 (LTS) / 2.48.0 (latest).

Alertmanager: 0.25.0 / 0.26.0.

Node Exporter: 1.5.0 / 1.6.1.

Recommended hardware: 8 CPU 16 GB RAM, 200 GB SSD.

Quick Checklist (Checklist)

Preparation

Verify Prometheus version: prometheus --version Backup existing rule files: cp /etc/prometheus/rules/*.yml /backup/ Validate Alertmanager config: amtool check-config /etc/alertmanager/alertmanager.yml Implementation

Deploy Node Exporter and enable the systemd service.

Edit prometheus.yml to add scrape targets.

Place alert rule files under /etc/prometheus/rules/.

Configure Alertmanager notification channels (Slack, Email, PagerDuty).

Hot‑reload Prometheus: curl -X POST http://localhost:9090/-/reload.

Verification

Check rule syntax: promtool check rules /etc/prometheus/rules/*.yml.

Validate targets: curl -s http://localhost:9090/api/v1/targets.

Test alert firing via UI or amtool alert add.

Optimization

Adjust thresholds, add for durations to avoid flapping.

Define silences and inhibit rules to prevent alert storms.

Implementation Steps

Architecture & Data Flow

监控目标（服务器/容器)
    ↓ 暴露指标（HTTP /metrics 端点)
Node Exporter / 应用 Exporter
    ↓ 定期抓取（默认 15s)
Prometheus Server（时间序列数据库）
    ↓ 规则评估（evaluation_interval: 15s)
告警规则引擎（基于 PromQL）
    ↓ 满足条件触发告警
Alertmanager（告警聚合与路由）
    ↓ 分组/抑制/静默处理
通知渠道（Slack/Email/PagerDuty/Webhook）

Step 1 – Deploy Node Exporter

RHEL/CentOS:

# Download and install Node Exporter
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
sudo cp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/

# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.tcpstat
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

# Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus

# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

Ubuntu/Debian (preferred):

# Install via package manager
sudo apt update
sudo apt install -y prometheus-node-exporter
# Service starts automatically

Step 2 – Configure Prometheus Scrape Targets

# Backup existing config
sudo cp /etc/prometheus/prometheus.yml /etc/prometheus/prometheus.yml.bak.$(date +%Y%m%d_%H%M%S)

# Edit /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'production'

rule_files:
  - '/etc/prometheus/rules/*.yml'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          env: 'production'

  - job_name: 'node_exporter'
    static_configs:
      - targets:
          - '192.168.1.10:9100'
          - '192.168.1.11:9100'
          - '192.168.1.20:9100'
        labels:
          env: 'production'
          role: 'backend'

  - job_name: 'my_application'
    static_configs:
      - targets:
          - '192.168.1.30:8080'
        labels:
          env: 'production'
          app: 'api_server'

Validate configuration:

promtool check config /etc/prometheus/prometheus.yml
curl -X POST http://localhost:9090/-/reload
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job, health}'

Step 3 – Deploy Alert Rule Files

Infrastructure alerts ( /etc/prometheus/rules/infrastructure.yml)

# /etc/prometheus/rules/infrastructure.yml
groups:
- name: infrastructure_alerts
  interval: 15s
  rules:
  - alert: NodeDown
    expr: up{job="node_exporter"} == 0
    for: 1m
    labels:
      severity: critical
      category: infrastructure
    annotations:
      summary: "Node {{ $labels.instance }} down"
      description: "Node has been offline for over 1 minute."
      runbook_url: "https://wiki.example.com/runbook/node-down"

  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
      category: infrastructure
    annotations:
      summary: "Node {{ $labels.instance }} high CPU usage"
      description: "CPU usage > 80% for 5 minutes."

  - alert: DiskSpaceLow
    expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.*"} / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.*"}) * 100 < 15
    for: 5m
    labels:
      severity: critical
      category: infrastructure
    annotations:
      summary: "Node {{ $labels.instance }} low disk space"
      description: "Available space < 15% on {{ $labels.mountpoint }}."

Application alerts ( /etc/prometheus/rules/applications.yml)

# /etc/prometheus/rules/applications.yml
groups:
- name: application_alerts
  interval: 15s
  rules:
  - alert: HighHTTP5xxRate
    expr: (sum(rate(http_requests_total{status=~"5.."}[5m])) by (job, instance)) / sum(rate(http_requests_total[5m])) by (job, instance) > 0.05
    for: 5m
    labels:
      severity: critical
      category: application
    annotations:
      summary: "{{ $labels.job }} high 5xx error rate"
      description: "5xx errors > 5% for 5 minutes."

  - alert: HighHTTPLatency
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)) > 2
    for: 10m
    labels:
      severity: warning
      category: application
    annotations:
      summary: "{{ $labels.job }} high latency"
      description: "P95 latency > 2s for 10 minutes."

Step 4 – Configure Alertmanager

# /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'your_password'
  smtp_require_tls: true

templates:
  - '/etc/alertmanager/templates/*.tmpl'

route:
  receiver: 'default-email'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
      group_wait: 10s
      repeat_interval: 5m
      continue: true
    - match:
        severity: warning
      receiver: 'warning-alerts'
      group_wait: 30s
      repeat_interval: 1h
    - match:
        severity: info
      receiver: 'info-alerts'
      group_wait: 5m
      repeat_interval: 24h
    - match_re:
        category: '(database|cache)'
      receiver: 'dba-team'
    - match:
        env: 'test'
      receiver: 'null'

inhibit_rules:
  - source_match:
      alertname: 'NodeDown'
    target_match_re:
      alertname: '.*'
    equal: ['instance']
  - source_match:
      alertname: 'HighCPUUsage'
    target_match:
      alertname: 'TooManyProcesses'
    equal: ['instance']

receivers:
  - name: 'default-email'
    email_configs:
      - to: '[email protected]'
        headers:
          Subject: '[Prometheus] {{ .GroupLabels.alertname }}'
  - name: 'critical-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts-critical'
        title: '🔴 P0 Alert'
        text: "*Alert:* {{ .GroupLabels.alertname }}
*Cluster:* {{ .GroupLabels.cluster }}
*Summary:* {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
        send_resolved: true
    pagerduty_configs:
      - service_key: 'your_pagerduty_service_key'
        description: '{{ .GroupLabels.alertname }}'
  - name: 'warning-alerts'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#alerts-warning'
        title: '🟠 P1 Alert'
        text: "*Alert:* {{ .GroupLabels.alertname }}"
        send_resolved: true
  - name: 'info-alerts'
    email_configs:
      - to: '[email protected]'
        headers:
          Subject: '[Info] {{ .GroupLabels.alertname }}'
  - name: 'dba-team'
    email_configs:
      - to: '[email protected]'
        headers:
          Subject: '[Database] {{ .GroupLabels.alertname }}'
  - name: 'null'
    # Blackhole for silenced test alerts

PromQL Core Mechanics

Prometheus stores each metric as a time‑series identified by metric name + label set . A typical selector such as up{job="node_exporter"} returns all matching series. The query engine then applies range vectors, functions (e.g., rate()), aggregations (e.g., sum by(instance)), and comparisons to produce a result vector.

Alert evaluation follows the pipeline: PromQL query → vector result → for duration check → state transition (Inactive → Pending → Firing) → send to Alertmanager.

Alertmanager Processing Flow

Received alerts are grouped by group_by keys, optionally suppressed by inhibit_rules, silenced, routed according to routes, de‑duplicated, and finally dispatched to configured receivers (Slack, Email, PagerDuty, etc.). Grouping reduces alert storms; inhibition prevents redundant notifications (e.g., suppressing all other alerts when a node is down).

Observability Metrics

TSDB storage size: prometheus_tsdb_storage_blocks_bytes Active series: prometheus_tsdb_head_series Samples per second: rate(prometheus_tsdb_head_samples_appended_total[5m]) Rule evaluation latency (P99):

histogram_quantile(0.99, rate(prometheus_rule_evaluation_duration_seconds_bucket[5m]))

Alertmanager notification success rate:

rate(alertmanager_notifications_total{integration="slack"}[5m]) / rate(alertmanager_notifications_failed_total{integration="slack"}[5m])

Performance Benchmarks

Scenario 1 – Scrape 100 targets with 1 000 metrics each. Expected CPU < 50 % and memory < 4 GB on a 4 C 8 GB node. Sample rate ≈ 6 666 samples/s.

Scenario 2 – Complex query latency:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

should return in < 200 ms.

Common Failures & Debugging

Alert not firing – check rule loading ( curl http://localhost:9090/api/v1/rules) and PromQL result.

Alert not delivered – verify Alertmanager is running, routes are correct, and webhook/SMTP connectivity.

Metrics not scraped – ensure target is reachable, firewall open, and scrape config correct.

Disk exhaustion – monitor node_filesystem_avail_bytes, adjust retention.time or prune high‑cardinality labels.

PromQL timeout – reduce time range, avoid high‑cardinality aggregations, or create recording rules.

Rule evaluation failures – guard against division‑by‑zero with or 0, verify metric existence.

Alert fatigue – increase for durations, raise thresholds, use grouping and inhibition.

Best Practices

Always use a for clause to avoid flapping.

Add runbook_url annotations for on‑call guidance.

Avoid high‑cardinality labels (e.g., IP, UUID); move such data to logs (ELK/Loki).

Pre‑compute expensive queries with recording rules and reference them in alerts.

Configure inhibition rules (e.g., suppress all node‑level alerts when NodeDown fires).

Run regular fire‑drill simulations (node crash, CPU spike, disk full) to validate alerts.

Use Alertmanager silences for maintenance windows.

For large deployments, consider federation or Thanos/Cortex for scalability and long‑term storage.

FAQ Highlights

Prometheus vs Zabbix/Nagios – Pull‑model, PromQL, cloud‑native focus vs SNMP‑centric, push‑model.

Why use recording rules? – Improves performance, reduces CPU load, simplifies alerts, essential for federation.

How to mitigate alert fatigue? – Raise thresholds, use for, group alerts, set inhibition, regular review.

Data retention recommendations – 7‑15 days for real‑time ops, 30‑90 days for capacity planning, 1‑3 years for compliance (use Thanos/VictoriaMetrics).

Kubernetes monitoring – Deploy node‑exporter, kube‑state‑metrics, cAdvisor, configure scrape jobs with kubernetes_sd_configs.

PromQL timeouts – Reduce range, avoid high‑cardinality aggregations, use recording rules.

High availability – Deploy multiple Prometheus instances with federation or use Thanos/Cortex for HA and long‑term storage.

Ready‑to‑Use Scripts

One‑click deployment (RHEL/CentOS 8+)

#!/bin/bash
set -e
# Variables
PROMETHEUS_VERSION="2.48.0"
ALERTMANAGER_VERSION="0.26.0"
NODE_EXPORTER_VERSION="1.6.1"
PROMETHEUS_USER="prometheus"
INSTALL_DIR="/opt/prometheus"
DATA_DIR="/var/lib/prometheus"
CONFIG_DIR="/etc/prometheus"

# Preconditions
if ! grep -qE 'CentOS|Red Hat' /etc/os-release; then
  echo "Error: Only RHEL/CentOS supported"
  exit 1
fi
if [[ $EUID -ne 0 ]]; then
  echo "Error: Run as root"
  exit 1
fi

# Create user
id -u $PROMETHEUS_USER &>/dev/null || useradd --no-create-home --shell /bin/false $PROMETHEUS_USER

# Install Prometheus
cd /tmp
wget -q https://github.com/prometheus/prometheus/releases/download/v$PROMETHEUS_VERSION/prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz
tar xzf prometheus-$PROMETHEUS_VERSION.linux-amd64.tar.gz
mkdir -p $INSTALL_DIR $DATA_DIR $CONFIG_DIR/rules
cp prometheus-$PROMETHEUS_VERSION.linux-amd64/prometheus $INSTALL_DIR/
cp prometheus-$PROMETHEUS_VERSION.linux-amd64/promtool $INSTALL_DIR/
cp -r prometheus-$PROMETHEUS_VERSION.linux-amd64/consoles $CONFIG_DIR/
cp -r prometheus-$PROMETHEUS_VERSION.linux-amd64/console_libraries $CONFIG_DIR/
chown -R $PROMETHEUS_USER:$PROMETHEUS_USER $INSTALL_DIR $DATA_DIR $CONFIG_DIR

# Prometheus config
cat > $CONFIG_DIR/prometheus.yml <<'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - '/etc/prometheus/rules/*.yml'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']
EOF
chown $PROMETHEUS_USER:$PROMETHEUS_USER $CONFIG_DIR/prometheus.yml

# Systemd service for Prometheus
cat > /etc/systemd/system/prometheus.service <<'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=$PROMETHEUS_USER
Group=$PROMETHEUS_USER
Type=simple
ExecStart=$INSTALL_DIR/prometheus \
  --config.file=$CONFIG_DIR/prometheus.yml \
  --storage.tsdb.path=$DATA_DIR \
  --storage.tsdb.retention.time=15d \
  --storage.tsdb.wal-compression \
  --web.console.templates=$CONFIG_DIR/consoles \
  --web.console.libraries=$CONFIG_DIR/console_libraries \
  --web.enable-lifecycle \
  --web.enable-admin-api
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

# Install Alertmanager
wget -q https://github.com/prometheus/alertmanager/releases/download/v$ALERTMANAGER_VERSION/alertmanager-$ALERTMANAGER_VERSION.linux-amd64.tar.gz
tar xzf alertmanager-$ALERTMANAGER_VERSION.linux-amd64.tar.gz
mkdir -p /etc/alertmanager /var/lib/alertmanager
cp alertmanager-$ALERTMANAGER_VERSION.linux-amd64/alertmanager $INSTALL_DIR/
cp alertmanager-$ALERTMANAGER_VERSION.linux-amd64/amtool $INSTALL_DIR/
cat > /etc/alertmanager/alertmanager.yml <<'EOF'
global:
  resolve_timeout: 5m

route:
  receiver: 'default-email'
  group_by: ['alertname', 'cluster']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h

receivers:
  - name: 'default-email'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.example.com:587'
        auth_username: '[email protected]'
        auth_password: 'your_password'
EOF
chown -R $PROMETHEUS_USER:$PROMETHEUS_USER /etc/alertmanager /var/lib/alertmanager
cat > /etc/systemd/system/alertmanager.service <<'EOF'
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=$PROMETHEUS_USER
Group=$PROMETHEUS_USER
Type=simple
ExecStart=$INSTALL_DIR/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager \
  --web.listen-address=:9093
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

# Install Node Exporter
wget -q https://github.com/prometheus/node_exporter/releases/download/v$NODE_EXPORTER_VERSION/node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz
tar xzf node_exporter-$NODE_EXPORTER_VERSION.linux-amd64.tar.gz
cp node_exporter-$NODE_EXPORTER_VERSION.linux-amd64/node_exporter $INSTALL_DIR/
cat > /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
User=$PROMETHEUS_USER
Group=$PROMETHEUS_USER
Type=simple
ExecStart=$INSTALL_DIR/node_exporter \
  --collector.systemd \
  --collector.processes \
  --collector.tcpstat
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

# Start services
systemctl daemon-reload
systemctl enable --now prometheus alertmanager node_exporter
sleep 5

# Verify
systemctl is-active prometheus && echo "✓ Prometheus running" || echo "✗ Prometheus failed"
systemctl is-active alertmanager && echo "✓ Alertmanager running" || echo "✗ Alertmanager failed"
systemctl is-active node_exporter && echo "✓ Node Exporter running" || echo "✗ Node Exporter failed"

echo "Installation complete. Access Prometheus at http://$(hostname -I | awk '{print $1}'):9090"
EOF

Health‑check script (check_prometheus_health.sh)

#!/bin/bash
set -e
PROMETHEUS_URL="http://localhost:9090"
ALERTMANAGER_URL="http://localhost:9093"

echo "=== Prometheus Health Check ==="

# Service status
systemctl is-active prometheus && echo "✓ Prometheus running" || echo "✗ Prometheus not running"
systemctl is-active alertmanager && echo "✓ Alertmanager running" || echo "✗ Alertmanager not running"
systemctl is-active node_exporter && echo "✓ Node Exporter running" || echo "✗ Node Exporter not running"

# Port listening
ss -tulnp | grep :9090 && echo "✓ Port 9090 listening" || echo "✗ Port 9090 not listening"
ss -tulnp | grep :9093 && echo "✓ Port 9093 listening" || echo "✗ Port 9093 not listening"
ss -tulnp | grep :9100 && echo "✓ Port 9100 listening" || echo "✗ Port 9100 not listening"

# Target health
DOWN=$(curl -s $PROMETHEUS_URL/api/v1/targets | jq -r '.data.activeTargets[] | select(.health != "up") | "\(.job)/\(.instance)"')
if [ -z "$DOWN" ]; then
  echo "✓ All targets up"
else
  echo "✗ Unreachable targets:"
  echo "$DOWN"
fi

# Rule errors
ERR=$(curl -s $PROMETHEUS_URL/api/v1/rules | jq -r '.data.groups[].rules[] | select(.lastError != null) | "\(.name): \(.lastError)"')
if [ -z "$ERR" ]; then
  echo "✓ No rule errors"
else
  echo "✗ Rule errors:"
  echo "$ERR"
fi

# Disk usage
USAGE=$(df -h /var/lib/prometheus | awk 'NR==2 {print $5}' | tr -d '%')
if [ $USAGE -lt 80 ]; then
  echo "✓ Disk usage $USAGE% (OK)"
else
  echo "⚠ Disk usage $USAGE% (High)"
fi

# TSDB series count
SERIES=$(curl -s $PROMETHEUS_URL/api/v1/status/tsdb | jq -r '.data.numSeries')
echo "Active series: $SERIES"
if [ $SERIES -lt 100000 ]; then
  echo "✓ Series count normal"
else
  echo "⚠ High series count – possible high‑cardinality labels"
fi

# Active alerts
ALERTS=$(curl -s $ALERTMANAGER_URL/api/v2/alerts | jq -r '.[].labels.alertname')
if [ -z "$ALERTS" ]; then
  echo "✓ No active alerts"
else
  echo "⚠ Active alerts:"
  echo "$ALERTS"
fi

echo "=== Check complete ==="

High‑cardinality detection script (detect_high_cardinality.sh)

#!/bin/bash
PROMETHEUS_URL="http://localhost:9090"
THRESHOLD=1000
echo "=== High Cardinality Detection (threshold: $THRESHOLD series) ==="
METRICS=$(curl -s $PROMETHEUS_URL/api/v1/label/__name__/values | jq -r '.data[]')
for metric in $METRICS; do
  COUNT=$(curl -s "$PROMETHEUS_URL/api/v1/series?match[]=$metric" | jq '.data | length')
  if [ $COUNT -gt $THRESHOLD ]; then
    echo "⚠ Metric $metric has $COUNT series"
    echo "  Top label cardinalities:"
    curl -s "$PROMETHEUS_URL/api/v1/series?match[]=$metric" |
      jq -r '.data[].[] | keys[]' |
      sort | uniq -c | sort -rn | head -n 5 |
      while read cnt label; do
        echo "    $label: $cnt distinct values"
      done
  fi
done

Production-Ready Prometheus Alerting: 50+ Core Metrics & Best Practices

Applicable Scenarios & Prerequisites

Anti‑Pattern Warnings

Alternative Solution Comparison

Environment & Version Matrix

Quick Checklist (Checklist)

Implementation Steps

Architecture & Data Flow

Step 1 – Deploy Node Exporter

Step 2 – Configure Prometheus Scrape Targets

Step 3 – Deploy Alert Rule Files

Step 4 – Configure Alertmanager

PromQL Core Mechanics

Alertmanager Processing Flow

Observability Metrics

Performance Benchmarks

Common Failures & Debugging

Best Practices

FAQ Highlights

Ready‑to‑Use Scripts

Further Reading

MaGe Linux Operations

How this landed with the community

Was this worth your time?

0 Comments

Applicable Scenarios & Prerequisites

Anti‑Pattern Warnings

Alternative Solution Comparison

Environment & Version Matrix

Quick Checklist (Checklist)

Implementation Steps

Architecture & Data Flow

Step 1 – Deploy Node Exporter

Step 2 – Configure Prometheus Scrape Targets

Step 3 – Deploy Alert Rule Files

Step 4 – Configure Alertmanager

PromQL Core Mechanics

Alertmanager Processing Flow

Observability Metrics

Performance Benchmarks

Common Failures & Debugging

Best Practices

FAQ Highlights

Ready‑to‑Use Scripts

Further Reading

MaGe Linux Operations

How this landed with the community

Was this worth your time?

0 Comments

Step 1 – Deploy Node Exporter

Step 2 – Configure Prometheus Scrape Targets

Step 3 – Deploy Alert Rule Files

Step 4 – Configure Alertmanager