How to Build Effective Monitoring for Microservices: Logs, Tracing, and Metrics Explained
This article explains the three main monitoring approaches—log collection, distributed tracing, and metric gathering—in microservice architectures, outlines the layered monitoring model, lists key system, application, and user metrics, and reviews popular open‑source time‑series monitoring tools such as Prometheus, OpenTSDB, and InfluxDB.
Monitoring in Microservice Architecture
In a microservice system a single user request traverses multiple services. When an error occurs the failing service and the associated metric must be identified, which requires comprehensive monitoring of each service and its key indicators.
Monitoring Categories
Log monitoring (unstructured event records)
Distributed tracing (call‑chain tracking)
Metrics monitoring (numeric time‑series data)
Log Monitoring
Application code, runtime frameworks and business logic emit log entries that are typically collected centrally for later search and analysis. A common implementation is the ELK stack (Elasticsearch + Logstash + Kibana). Optional Beats agents run on each host to ship raw log files to Logstash, where they are parsed, filtered and enriched before being indexed in Elasticsearch. Kibana provides visual exploration of the indexed logs.
Typical data flow: Beats → Logstash → Elasticsearch → Kibana Both the basic stack and extended variants (e.g., adding additional processing pipelines) are widely used for log‑based monitoring and debugging.
Distributed Tracing
Tracing records the complete lifecycle of a request as it propagates through multiple services, enabling pinpointing of failures or performance bottlenecks. Tools such as CAT (Common Application Tracing) are often adopted in medium‑to‑large projects, though they require additional instrumentation and infrastructure.
A simple fault‑tolerance pattern is to set an active timeout on inter‑service calls: if the downstream service does not respond within the configured threshold, the caller aborts the request to avoid cascading delays.
Metrics Monitoring
Metrics are stored in time‑series databases (TSDB) as numeric values associated with timestamps. They support aggregation, trend analysis and are the primary source for alerting. Five fundamental metric types are commonly used:
Gauges – instantaneous values
Counters – monotonically increasing counts
Histograms – distribution of observed values
Meters – rate calculations (e.g., transactions per second)
Timers – duration measurements
Monitoring Layers and Core Indicators
Monitoring is usually organized into three layers:
System layer – CPU, disk, memory, network (operations focus)
Application layer – service health, API status, internal error codes (development focus)
User layer – business‑level metrics such as conversion rate or revenue (product focus)
Typical key indicators across these layers include:
Latency – e.g., average HTTP response time of 100 ms
Request volume – throughput such as QPS (queries per second)
Error rate – proportion of failed calls over a time window
Open‑Source Time‑Series Monitoring Solutions
Prometheus
Released in 2012, Prometheus is an open‑source monitoring framework built around a TSDB. It primarily uses a pull model: Prometheus server scrapes metrics from instrumented applications or from exporters. For workloads that cannot be scraped (e.g., batch jobs), a Pushgateway can be used to receive pushed metrics, which Prometheus then pulls.
Configuration can be static or driven by service‑discovery mechanisms (Kubernetes, Consul, etc.). Core components:
PromQL – a flexible query language for selecting and aggregating time‑series data
Alertmanager – handles alert routing, silencing and notification (email, Slack, webhook, etc.)
Web UI – basic graphing; most users pair Prometheus with Grafana for advanced dashboards
OpenTSDB
OpenTSDB, launched in 2010, is a distributed TSDB that stores metrics in HBase. It follows a push model: agents or applications push metric points to OpenTSDB’s HTTP API. The system provides a built‑in Web UI and integrates smoothly with Grafana for visualization. OpenTSDB does not include a native alerting component, so external alerting solutions must be added.
InfluxDB
InfluxDB, open‑sourced in 2013, is another TSDB that accepts metrics via a push API (line protocol). It includes a Web UI for query and exploration and can be visualized with Grafana. Like OpenTSDB, InfluxDB provides basic alerting rules but many deployments rely on external alert managers for production‑grade notifications.
In summary, effective monitoring of microservice systems combines log collection, distributed tracing, and time‑series metric gathering. The three‑layer monitoring model (system, application, user) guides indicator selection, while mature open‑source TSDB solutions such as Prometheus, OpenTSDB and InfluxDB provide the foundation for scalable metric storage, querying and alerting.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
