Operations 9 min read

How to Build a Scalable Log Monitoring System for Hundreds of Microservices

In large‑scale microservice environments, centralized log collection, filtering, and visualization using Filebeat, Elastic APM, Kafka Streams, Grafana and Prometheus can turn scattered logs into actionable operational data while controlling resource costs.

Java Architect Essentials

Sep 28, 2020

How to Build a Scalable Log Monitoring System for Hundreds of Microservices

Background

In large micro‑service environments each service writes logs locally, making it difficult to locate logs for troubleshooting, performance analysis, or business insight.

Architecture Overview

Each service node runs a logging agent that streams logs in real time to a central Kafka cluster. The processing pipeline is:

Filebeat → Kafka → Log Streams (Kafka Streams) → Elasticsearch / Prometheus → Grafana & Kibana

Log Collection with Filebeat

Filebeat is deployed on every host via an automated release platform. Operators configure each Filebeat instance through a web UI, assigning it to one or more Kafka topics (one‑to‑one or many‑to‑one) based on log volume. In addition to business‑service logs, Filebeat also ships MySQL slow‑query and error logs, Nginx access/error logs, and other third‑party logs.

Application Performance Monitoring

Elastic APM agents are attached to services without modifying application code. The agents capture:

HTTP request traces and call stacks

SQL statements executed by the service

CPU, memory and other process metrics

Limitations of Elastic APM:

Not all languages are supported (e.g., C)

Non‑error business logs are not collected

Custom business exceptions may be mis‑classified as system errors, causing noisy alerts

Extended Agent Metrics

Custom extensions to the APM agents collect detailed GC, heap, memory and thread information for deeper diagnostics.

Host‑level Metrics

Prometheus scrapes standard host metrics (CPU, memory, network, disk I/O, etc.) from each node and stores them as time‑series data.

Log Filtering and Resource Management (Log Streams)

All logs are first ingested into Kafka with a short retention window (default 1 hour). A dedicated Log Streams service built on Kafka Streams performs ETL filtering and cleaning to reduce storage and processing costs. Configuration is UI‑driven and supports the following rules:

Default collection of all logs at error level.

Windowed collection around error timestamps to also capture surrounding info logs (configurable N seconds before/after the error event).

Per‑service whitelist of up to 100 key log patterns, collected in full.

Business‑specific slow‑SQL filtering with configurable latency thresholds.

Real‑time aggregation of business SQL statistics (e.g., query frequency per hour) to aid DBA optimization.

Dynamic weighting of log levels, service‑wise limits and time‑window adjustments during peak load.

Automatic shrinking of windows when system load is high.

Index naming convention {service}_{level}_{date} (e.g., order-service_error_20230418) to preserve familiar search patterns in Elasticsearch.

Visualization

Grafana dashboards consume Prometheus metrics and Elasticsearch indices for real‑time monitoring, alerting and ad‑hoc queries. Kibana is used for detailed APM trace analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Microservices Operations grafana Log Monitoring kafka streams filebeat elastic apm

Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.