How to Build a TB‑Scale Log Monitoring System with ELK Stack
This article explains how to design and implement a TB‑level log monitoring platform for micro‑service environments using ELK Stack, Filebeat, Elastic APM, Kafka Streams, Prometheus, and Grafana, covering data collection, filtering, storage, and visualization while addressing cost and resource constraints.
In large micro‑service deployments, thousands of services generate massive amounts of logs that are essential for troubleshooting, performance tuning, and business analysis. Storing logs locally on each node makes it difficult to locate relevant data and to extract business value.
The proposed solution centralizes log collection, processing, and visualization using the ELK Stack (Elasticsearch, Logstash, Kibana) together with complementary tools such as Filebeat, Elastic APM, Kafka Streams, Prometheus, and Grafana.
Solution Overview
The architecture consists of the following key steps:
Uniform log collection and cleansing.
Generation of visual dashboards, alerts, and searchable indices.
Log Collection Architecture
Each service node runs a Filebeat instance configured via a backend UI. Filebeat forwards logs to Kafka topics; the mapping can be one‑to‑one or many‑to‑one depending on log volume. In addition to business logs, MySQL slow‑query logs, error logs, and third‑party service logs (e.g., Nginx) are also collected.
Elastic APM agents are deployed to capture HTTP call traces, method call stacks, SQL statements, CPU and memory metrics without modifying application code. While APM covers >80% of incident detection, it does not support all languages (e.g., C) and cannot capture non‑error or custom business logs, so Filebeat remains necessary.
Additional components include:
Custom extensions to the APM agent for detailed GC, heap, thread, and memory metrics.
Prometheus for server‑level metrics collection.
Log Filtering and Storage Strategy
Because unlimited log retention is infeasible, logs are first ingested into a Kafka cluster with a short retention window (default one hour). A custom service called Log Streams performs ETL filtering using Kafka Streams, applying dynamic rules such as:
Default full collection of Error‑level logs.
Windowed collection of surrounding logs (configurable N minutes) around error timestamps, defaulting to INFO level.
Per‑service configuration of up to 100 key logs, collected in full.
Business‑specific slow‑SQL filtering and re‑aggregation.
Real‑time statistics of business SQL for DBA optimization.
Dynamic weighting and quota limits per service during peak periods.
Time‑based window adjustments.
Index naming per service and log level (debug, info, error, custom) with date suffixes.
Filtered logs are stored in Elasticsearch, while metrics are stored in Prometheus. Grafana visualizes both data sources, and Kibana is used for APM‑specific analysis.
Visualization
The final dashboards provide searchable log interfaces, alerting rules, and performance charts, enabling developers and operations teams to quickly pinpoint issues and monitor system health.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
