Operations 19 min read

Choosing and Tuning Open‑Source Monitoring Stacks for Large‑Scale Operations

This article reviews common open‑source monitoring tools, shares the evolution of China Unicom's big‑data platform monitoring, and provides practical guidance on selecting collectors, databases, and visualization components, with detailed configurations for Prometheus, Alertmanager, Grafana, and automation recovery techniques.

dbaplus Community

Apr 24, 2019

Choosing and Tuning Open‑Source Monitoring Stacks for Large‑Scale Operations

1. Common Monitoring Tool Combinations

Typical monitoring stacks include:

Nagios + Ganglia

Zabbix

Telegraf (or collect) + InfluxDB (or Prometheus or Elasticsearch) + Grafana + Alertmanager

Nagios, Ganglia, and Zabbix are legacy tools; Grafana and Prometheus are newer, more flexible solutions. Each combination has its own strengths and weaknesses, and the best choice depends on the specific workload and scale.

2. Evolution of the China Unicom Big‑Data Platform Monitoring

Initially the platform combined Ganglia (for metric collection) with Nagios (for alerting). As data volume and service complexity grew, the duo became cumbersome: configuration was manual, historical data was missing, and scaling was limited.

Mid‑stage, Zabbix was introduced, but its performance and multi‑dimensional monitoring overhead proved problematic for thousands of nodes.

Eventually the team adopted a Prometheus‑Grafana‑Alertmanager (PGA) stack, which offered flexible data collection, powerful multi‑dimensional queries, and scalable alert routing, handling millions of metrics across a few thousand machines.

3. Component Selection and Configuration Tips

3.1 Collector Choice

Common collectors: collect, telegraf, jmxtrans. The author prefers telegraf for its stability, active community, and Go‑based lightweight agent.

3.2 Database Choice

InfluxDB is often used, but for large clusters the open‑source version may hit write/read limits. Adjust retention policies, e.g.:

ALTER RETENTION POLICY "autogen" ON "telegraf" DURATION 72h REPLICATION 1 SHARD DURATION 24h DEFAULT

Parameters explained:

duration : retention time (0 = unlimited)

shardGroupDuration : storage granularity affecting query performance

replicaN : number of replicas

default : whether this policy is the default

When InfluxDB performance becomes a bottleneck, the author switched to Elasticsearch or Prometheus federation.

3.3 Grafana Visualization Tips

Grafana can import JSON dashboards, display host‑level metrics (CPU, memory, I/O, network, inode, process/thread counts) and provide top‑10 resource usage views. Images illustrate host dashboards, resource‑top lists, and process‑level details.

4. Prometheus and Alertmanager Deep Dive

4.1 Prometheus Overview

Prometheus is a TSDB that stores metrics locally, supports powerful label‑based queries via PromQL, and can scrape thousands of targets per second. It integrates with Alertmanager for rule‑based alerting.

4.2 Prometheus Features

Multi‑dimensional storage and query

Extensible client libraries for services like Redis, MySQL, Nginx, HAProxy

Pushgateway for metrics that cannot be pulled

Pushgateway workflow: agents collect metrics, push them to the gateway, and Prometheus pulls them on schedule.

4.3 Sample Prometheus Configuration

global:
  scrape_interval: 15s
  evaluation_interval: 15s
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['IP:9093']
rule_files:
  - "first_rules.yml"
  - "second_rules.yml"
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 15s
    static_configs:
      - targets: ['localdns:9090']

4.4 Alertmanager Overview

Alertmanager receives alerts from Prometheus, groups them, applies silencing and inhibition rules, and forwards them via email, DingTalk, WeChat, etc.

4.5 Alertmanager Features

Grouping : consolidates similar alerts into a single notification.

Inhibition : suppresses secondary alerts when a primary condition is already firing.

Silencing : temporarily mute alerts during known maintenance windows.

4.6 Sample Alertmanager Configuration

global:
  resolve_timeout: 5m
templates:
  - 'template/*.tmpl'
route:
  group_by: ['cluster']
  group_wait: 10s
  group_interval: 20s
  repeat_interval: 30m
  receiver: 'host'
  routes:
    - receiver: 'example'
      match:
        cluster: example
      continue: true
receivers:
  - name: 'example'
    webhook_configs:
      - url: 'http://localhost:8180/dingtalk/ops_dingding/send'
inhibit_rules:
  - source_match:
    target_match_re:
    equal: ['ipAddress']

5. Automation Recovery Tips

Using Prometheus alerts as triggers, operators can automate remediation with tools like Fabric or Ansible. Example scenarios include clearing swap partitions, correcting clock drift, restarting Cloudera Manager agents, replacing failed disks, bringing downed role instances back online, and performing data balancer operations when storage thresholds are exceeded.

Automation should be applied judiciously; rare or high‑impact failures may still require manual intervention.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations Prometheus InfluxDB Grafana Alertmanager telegraf

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.