Designing a Scalable Monitoring System: From Data Collection to Alerting
This article explains how to build a comprehensive monitoring system for distributed applications by classifying monitoring functions, describing data quadrants, outlining core modules such as collection, processing, feature extraction, and visualization, and reviewing typical implementations for metrics, logs, tracing, alerting, and the key open‑source components involved.
Why Monitoring Matters
Monitoring is an essential component of any distributed system, providing early warnings, troubleshooting assistance, and decision‑making support. It applies to everything from host‑level CPU alerts to business‑level log errors and APM triggers.
Functional Division of Monitoring Systems
Monitoring can be categorized in two ways:
Data Quadrants: logs, metrics, and tracing.
Functional Quadrants: basic monitoring, middleware monitoring, and business monitoring.
These perspectives help keep the system modular and avoid tangled designs.
Core Modules in a Monitoring Pipeline
Data Collection: Aggregating data efficiently and at scale.
Data Processing: Organizing, transmitting, and storing the collected data.
Feature Extraction: Performing large‑scale calculations to generate intermediate results.
Data Presentation: Providing attractive, multi‑functional visualizations.
Typical Implementations
1. System Monitoring
Collects host metrics (CPU, memory, network, disk, kernel) and many database or middleware indicators. Metrics are usually fixed‑schema and stored in a time‑series database such as InfluxDB. Common collectors include telegraf (supports a wide range of system and service metrics) and jolokia2 for JVM monitoring.
After collection, metrics are often buffered in Kafka, then split: one copy is filtered by Logstash and stored in Elasticsearch, another copy is processed by stream‑processing jobs to compute aggregates like QPS, average response time, or TP values.
Grafana is the preferred visualization tool for its aesthetics and support for iframe embedding, though its alerting capabilities are limited.
2. Log Monitoring
Log pipelines share many components with system monitoring but handle much larger, less‑structured data volumes. Reliable log collectors such as Flume or Beats are recommended over Logstash for resource efficiency. Logs are typically buffered in a message queue, filtered according to logging standards, and indexed in Elasticsearch (often with daily indices). Older logs may be archived in a log‑fortress or HDFS.
3. Tracing (APM)
Distributed tracing adds significant complexity because it must collect massive amounts of span data across heterogeneous services. OpenTracing standardizes APIs, enabling compatibility among implementations like Zipkin, Jaeger, and Pinpoint. Key challenges include heterogeneous collection agents (Java, Go, etc.), diverse instrumentation points, asynchronous context propagation, and efficient sampling.
Jaeger’s architecture, for example, consists of a Jaeger client, an agent listening on UDP, a collector that writes to a pluggable data store (Cassandra or Elasticsearch), and a query service that serves UI visualizations.
4. Analysis and Alerting
Stream processing (e.g., Kafka Streams) aggregates raw data into metrics such as QPS, latency, or custom thresholds. The resulting analysis data is stored separately and used for both alerting and dashboards. Alert configurations may include count‑based triggers, threshold comparisons, ratio checks (ring‑ratio, year‑over‑year), and custom expressions, with actions like email, SMS, webhook, or phone calls.
Component Catalog
Data Collection
Telegraf – Go‑based agent for metrics.
Flume – High‑availability log collector (Apache).
Logstash – Flexible log processor (Elastic stack).
StatsD – UDP‑based metric collector for Node.js.
CollectD – Daemon for system and application performance metrics.
Visualization
Grafana – Feature‑rich, aesthetically focused dashboard.
Storage
InfluxDB – Open‑source time‑series database.
OpenTSDB – Time‑series layer built on HBase.
Elasticsearch – Full‑text search engine that also stores metrics, logs, and traces.
Solutions
Open‑Falcon – Xiaomi’s integrated monitoring suite.
Graphite – Metric storage and query engine, often paired with Grafana.
Prometheus – Go‑based monitoring system with strong Spring Cloud integration.
Traditional Monitoring
Zabbix – Widely used, suitable for small‑to‑medium deployments.
Nagios – Legacy solution with complex configuration.
Ganglia – Focused on system performance metrics.
Centreon – Extends Nagios with additional features.
APM Tools
CAT – Meituan‑Dianping’s internal monitoring framework.
Pinpoint – Java‑centric APM using bytecode instrumentation.
SkyWalking – Apache‑incubated, Java and Go support, uses Elasticsearch.
Zipkin – OpenTracing‑compatible tracing system.
Jaeger – Go‑based Uber tracing system with OpenTracing support.
Other
Datadog – Commercial SaaS monitoring platform with extensive integrations.
Conclusion
The monitoring ecosystem can be broken down into three data types—logs, metrics, and traces—and three processing stages—collection, processing, and application. While many components are interchangeable, the design should keep collection and application concerns separate to maintain flexibility and scalability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
