How to Build a Scalable Log Monitoring System for Hundreds of Microservices
In large‑scale microservice environments, centralized log collection, filtering, and visualization using Filebeat, Elastic APM, Kafka Streams, Grafana and Prometheus can turn scattered logs into actionable operational data while controlling resource costs.
Background
In large micro‑service environments each service writes logs locally, making it difficult to locate logs for troubleshooting, performance analysis, or business insight.
Architecture Overview
Each service node runs a logging agent that streams logs in real time to a central Kafka cluster. The processing pipeline is:
Filebeat → Kafka → Log Streams (Kafka Streams) → Elasticsearch / Prometheus → Grafana & KibanaLog Collection with Filebeat
Filebeat is deployed on every host via an automated release platform. Operators configure each Filebeat instance through a web UI, assigning it to one or more Kafka topics (one‑to‑one or many‑to‑one) based on log volume. In addition to business‑service logs, Filebeat also ships MySQL slow‑query and error logs, Nginx access/error logs, and other third‑party logs.
Application Performance Monitoring
Elastic APM agents are attached to services without modifying application code. The agents capture:
HTTP request traces and call stacks
SQL statements executed by the service
CPU, memory and other process metrics
Limitations of Elastic APM:
Not all languages are supported (e.g., C)
Non‑error business logs are not collected
Custom business exceptions may be mis‑classified as system errors, causing noisy alerts
Extended Agent Metrics
Custom extensions to the APM agents collect detailed GC, heap, memory and thread information for deeper diagnostics.
Host‑level Metrics
Prometheus scrapes standard host metrics (CPU, memory, network, disk I/O, etc.) from each node and stores them as time‑series data.
Log Filtering and Resource Management (Log Streams)
All logs are first ingested into Kafka with a short retention window (default 1 hour). A dedicated Log Streams service built on Kafka Streams performs ETL filtering and cleaning to reduce storage and processing costs. Configuration is UI‑driven and supports the following rules:
Default collection of all logs at error level.
Windowed collection around error timestamps to also capture surrounding info logs (configurable N seconds before/after the error event).
Per‑service whitelist of up to 100 key log patterns, collected in full.
Business‑specific slow‑SQL filtering with configurable latency thresholds.
Real‑time aggregation of business SQL statistics (e.g., query frequency per hour) to aid DBA optimization.
Dynamic weighting of log levels, service‑wise limits and time‑window adjustments during peak load.
Automatic shrinking of windows when system load is high.
Index naming convention {service}_{level}_{date} (e.g., order-service_error_20230418) to preserve familiar search patterns in Elasticsearch.
Visualization
Grafana dashboards consume Prometheus metrics and Elasticsearch indices for real‑time monitoring, alerting and ad‑hoc queries. Kibana is used for detailed APM trace analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Architect Essentials
Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
