Choosing and Tuning Open‑Source Monitoring Stacks for Large‑Scale Operations
This article reviews common open‑source monitoring tools, shares the evolution of China Unicom's big‑data platform monitoring, and provides practical guidance on selecting collectors, databases, and visualization components, with detailed configurations for Prometheus, Alertmanager, Grafana, and automation recovery techniques.
1. Common Monitoring Tool Combinations
Typical monitoring stacks include:
Nagios + Ganglia
Zabbix
Telegraf (or collect) + InfluxDB (or Prometheus or Elasticsearch) + Grafana + Alertmanager
Nagios, Ganglia, and Zabbix are legacy tools; Grafana and Prometheus are newer, more flexible solutions. Each combination has its own strengths and weaknesses, and the best choice depends on the specific workload and scale.
2. Evolution of the China Unicom Big‑Data Platform Monitoring
Initially the platform combined Ganglia (for metric collection) with Nagios (for alerting). As data volume and service complexity grew, the duo became cumbersome: configuration was manual, historical data was missing, and scaling was limited.
Mid‑stage, Zabbix was introduced, but its performance and multi‑dimensional monitoring overhead proved problematic for thousands of nodes.
Eventually the team adopted a Prometheus‑Grafana‑Alertmanager (PGA) stack, which offered flexible data collection, powerful multi‑dimensional queries, and scalable alert routing, handling millions of metrics across a few thousand machines.
3. Component Selection and Configuration Tips
3.1 Collector Choice
Common collectors: collect, telegraf, jmxtrans. The author prefers telegraf for its stability, active community, and Go‑based lightweight agent.
3.2 Database Choice
InfluxDB is often used, but for large clusters the open‑source version may hit write/read limits. Adjust retention policies, e.g.:
ALTER RETENTION POLICY "autogen" ON "telegraf" DURATION 72h REPLICATION 1 SHARD DURATION 24h DEFAULTParameters explained:
duration : retention time (0 = unlimited)
shardGroupDuration : storage granularity affecting query performance
replicaN : number of replicas
default : whether this policy is the default
When InfluxDB performance becomes a bottleneck, the author switched to Elasticsearch or Prometheus federation.
3.3 Grafana Visualization Tips
Grafana can import JSON dashboards, display host‑level metrics (CPU, memory, I/O, network, inode, process/thread counts) and provide top‑10 resource usage views. Images illustrate host dashboards, resource‑top lists, and process‑level details.
4. Prometheus and Alertmanager Deep Dive
4.1 Prometheus Overview
Prometheus is a TSDB that stores metrics locally, supports powerful label‑based queries via PromQL, and can scrape thousands of targets per second. It integrates with Alertmanager for rule‑based alerting.
4.2 Prometheus Features
Multi‑dimensional storage and query
Extensible client libraries for services like Redis, MySQL, Nginx, HAProxy
Pushgateway for metrics that cannot be pulled
Pushgateway workflow: agents collect metrics, push them to the gateway, and Prometheus pulls them on schedule.
4.3 Sample Prometheus Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['IP:9093']
rule_files:
- "first_rules.yml"
- "second_rules.yml"
scrape_configs:
- job_name: 'prometheus'
scrape_interval: 15s
static_configs:
- targets: ['localdns:9090']4.4 Alertmanager Overview
Alertmanager receives alerts from Prometheus, groups them, applies silencing and inhibition rules, and forwards them via email, DingTalk, WeChat, etc.
4.5 Alertmanager Features
Grouping : consolidates similar alerts into a single notification.
Inhibition : suppresses secondary alerts when a primary condition is already firing.
Silencing : temporarily mute alerts during known maintenance windows.
4.6 Sample Alertmanager Configuration
global:
resolve_timeout: 5m
templates:
- 'template/*.tmpl'
route:
group_by: ['cluster']
group_wait: 10s
group_interval: 20s
repeat_interval: 30m
receiver: 'host'
routes:
- receiver: 'example'
match:
cluster: example
continue: true
receivers:
- name: 'example'
webhook_configs:
- url: 'http://localhost:8180/dingtalk/ops_dingding/send'
inhibit_rules:
- source_match:
target_match_re:
equal: ['ipAddress']5. Automation Recovery Tips
Using Prometheus alerts as triggers, operators can automate remediation with tools like Fabric or Ansible. Example scenarios include clearing swap partitions, correcting clock drift, restarting Cloudera Manager agents, replacing failed disks, bringing downed role instances back online, and performing data balancer operations when storage thresholds are exceeded.
Automation should be applied judiciously; rare or high‑impact failures may still require manual intervention.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
