Operations 18 min read

Designing a Scalable Monitoring System: From Data Collection to Alerting

This article explains how to build a comprehensive monitoring system for distributed applications by classifying monitoring functions, describing data quadrants, outlining core modules such as collection, processing, feature extraction, and visualization, and reviewing typical implementations for metrics, logs, tracing, alerting, and the key open‑source components involved.

dbaplus Community

May 22, 2019

Designing a Scalable Monitoring System: From Data Collection to Alerting

Why Monitoring Matters

Monitoring is an essential component of any distributed system, providing early warnings, troubleshooting assistance, and decision‑making support. It applies to everything from host‑level CPU alerts to business‑level log errors and APM triggers.

Functional Division of Monitoring Systems

Monitoring can be categorized in two ways:

Data Quadrants: logs, metrics, and tracing.

Functional Quadrants: basic monitoring, middleware monitoring, and business monitoring.

These perspectives help keep the system modular and avoid tangled designs.

Core Modules in a Monitoring Pipeline

Data Collection: Aggregating data efficiently and at scale.

Data Processing: Organizing, transmitting, and storing the collected data.

Feature Extraction: Performing large‑scale calculations to generate intermediate results.

Data Presentation: Providing attractive, multi‑functional visualizations.

Typical Implementations

1. System Monitoring

Collects host metrics (CPU, memory, network, disk, kernel) and many database or middleware indicators. Metrics are usually fixed‑schema and stored in a time‑series database such as InfluxDB. Common collectors include telegraf (supports a wide range of system and service metrics) and jolokia2 for JVM monitoring.

After collection, metrics are often buffered in Kafka, then split: one copy is filtered by Logstash and stored in Elasticsearch, another copy is processed by stream‑processing jobs to compute aggregates like QPS, average response time, or TP values.

Grafana is the preferred visualization tool for its aesthetics and support for iframe embedding, though its alerting capabilities are limited.

2. Log Monitoring

Log pipelines share many components with system monitoring but handle much larger, less‑structured data volumes. Reliable log collectors such as Flume or Beats are recommended over Logstash for resource efficiency. Logs are typically buffered in a message queue, filtered according to logging standards, and indexed in Elasticsearch (often with daily indices). Older logs may be archived in a log‑fortress or HDFS.

3. Tracing (APM)

Distributed tracing adds significant complexity because it must collect massive amounts of span data across heterogeneous services. OpenTracing standardizes APIs, enabling compatibility among implementations like Zipkin, Jaeger, and Pinpoint. Key challenges include heterogeneous collection agents (Java, Go, etc.), diverse instrumentation points, asynchronous context propagation, and efficient sampling.

Jaeger’s architecture, for example, consists of a Jaeger client, an agent listening on UDP, a collector that writes to a pluggable data store (Cassandra or Elasticsearch), and a query service that serves UI visualizations.

4. Analysis and Alerting

Stream processing (e.g., Kafka Streams) aggregates raw data into metrics such as QPS, latency, or custom thresholds. The resulting analysis data is stored separately and used for both alerting and dashboards. Alert configurations may include count‑based triggers, threshold comparisons, ratio checks (ring‑ratio, year‑over‑year), and custom expressions, with actions like email, SMS, webhook, or phone calls.

Component Catalog

Data Collection

Telegraf – Go‑based agent for metrics.

Flume – High‑availability log collector (Apache).

Logstash – Flexible log processor (Elastic stack).

StatsD – UDP‑based metric collector for Node.js.

CollectD – Daemon for system and application performance metrics.

Visualization

Grafana – Feature‑rich, aesthetically focused dashboard.

Storage

InfluxDB – Open‑source time‑series database.

OpenTSDB – Time‑series layer built on HBase.

Elasticsearch – Full‑text search engine that also stores metrics, logs, and traces.

Solutions

Open‑Falcon – Xiaomi’s integrated monitoring suite.

Graphite – Metric storage and query engine, often paired with Grafana.

Prometheus – Go‑based monitoring system with strong Spring Cloud integration.

Traditional Monitoring

Zabbix – Widely used, suitable for small‑to‑medium deployments.

Nagios – Legacy solution with complex configuration.

Ganglia – Focused on system performance metrics.

Centreon – Extends Nagios with additional features.

APM Tools

CAT – Meituan‑Dianping’s internal monitoring framework.

Pinpoint – Java‑centric APM using bytecode instrumentation.

SkyWalking – Apache‑incubated, Java and Go support, uses Elasticsearch.

Zipkin – OpenTracing‑compatible tracing system.

Jaeger – Go‑based Uber tracing system with OpenTracing support.

Other

Datadog – Commercial SaaS monitoring platform with extensive integrations.

Conclusion

The monitoring ecosystem can be broken down into three data types—logs, metrics, and traces—and three processing stages—collection, processing, and application. While many components are interchangeable, the design should keep collection and application concerns separate to maintain flexibility and scalability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Monitoring Metrics Tracing visualization

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.